Elevated API Errors
Incident Report for Balena.io
Postmortem

We’ve added a large number of devices added to our platform over the past few months, which has revealed some bottlenecks in the backend, one of which being the frequency of device metrics we processing (once every 10 seconds). We’ve also observed undesirable system degradation under these conditions, due to reconnect storms from hundreds of thousands of balena Supervisor(s) running in the field after a backend crash/restart.

We’ve deployed stability fixes to limit the number of metrics data points stored in the database to once every 60 seconds. We’ve also started investigating options for scaling out our read DB workloads and beyond that, to sharing our backend databases to allow for smooth scaling towards much larger device fleets.

Posted Oct 19, 2021 - 22:30 UTC

Resolved
This incident has been resolved.
Posted Oct 19, 2021 - 21:52 UTC
Monitoring
We've deployed a new API release containing fixes for DB connection pool management as well as temporarily disabled reported device metrics in order to reduce the load on the backend.
Posted Oct 19, 2021 - 20:41 UTC
Update
Requests are failing because of a bug in the database connection pooling. We are working on a fix that will resolve this problem.
Posted Oct 19, 2021 - 19:32 UTC
Update
A change introduced in the API to reduce load spikes caused a snowball effect that generated more queries into the backend. A fix has been released and we are now processing the backlog of requests.
Posted Oct 19, 2021 - 17:28 UTC
Update
We are continuing to work on a fix for this issue.
Posted Oct 19, 2021 - 17:24 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 19, 2021 - 15:00 UTC
Investigating
We're experiencing an elevated level of API errors and are currently looking into the issue. The database is catering to large amounts of device requests at the moment. We are working to recover the system.
Posted Oct 19, 2021 - 14:28 UTC
This incident affected: API.