We’ve added a large number of devices added to our platform over the past few months, which has revealed some bottlenecks in the backend, one of which being the frequency of device metrics we processing (once every 10 seconds). We’ve also observed undesirable system degradation under these conditions, due to reconnect storms from hundreds of thousands of balena Supervisor(s) running in the field after a backend crash/restart.
We’ve deployed stability fixes to limit the number of metrics data points stored in the database to once every 60 seconds. We’ve also started investigating options for scaling out our read DB workloads and beyond that, to sharing our backend databases to allow for smooth scaling towards much larger device fleets.