Degraded Agent Dispatch and API performance
Incident Report for Buildkite
Postmortem

Service Impact

On 2022-01-17 between 2:06 UTC and 2:59 UTC, pipelines notification workers, job dispatch and Agent API had degraded performance.

Incident Timeline

This incident began at 1:52 UTC, when an error in the Test Analytics application caused a large number of sidekiq retries that flooded our Redis server. Our investigation of the incident uncovered Redis timeouts were cross-impacting the Pipelines product and job dispatch was delayed by up to 20 seconds.

The deployment that caused the error was closely monitored and a fix was merged at 2:01 UTC.

Our monitoring system notified us of the incident at 2:06 UTC. No customer impact was reported.

Due to the degraded state of job dispatching, our fix failed to deploy through our usual deployment pipeline. In order to mitigate the impact and deploy the fix, Test Analytics was placed into maintenance mode at 2:14 UTC.

An incident was posted to statuspage at 2:20 UTC.

An attempt was made to terminate the broken sidekiq jobs however our production console was unvailable due to the same Redis timeout issue. As load began to decrease we were able to gain access and terminate the failing jobs at 2:38 UTC. A manual deployment rollback was performed shortly afterwards.

System performance returned to normal at 2:55 UTC.

At around 3:00 UTC we performed a manual deployment of a fix.

The incident was marked as resolved at 3:12 UTC.

Changes we’re making

As part of our reliability review and the release of Test Analytics, we’re allocating additional dedicated infrastructure for Test Analytics in the form of sidekiq, Redis, and ActionCable. This will mitigate the chance of cross-impacting outages of Test Analytics and Pipelines. Test Analytics already has a dedicated database server. Work on dedicated infrastructure had commenced in December 2021 prior to this incident. This work is due to complete before the General Access release of Test Analytics.

Posted Jan 21, 2022 - 02:00 UTC

Resolved
This incident has been resolved.
Posted Jan 17, 2022 - 03:12 UTC
Update
System performance is returning to normal. We are continuing to monitor results.
Posted Jan 17, 2022 - 02:59 UTC
Monitoring
The fix has been deployed and we are monitoring the results.
Posted Jan 17, 2022 - 02:38 UTC
Identified
We have identified a problem and deploying a fix.
Posted Jan 17, 2022 - 02:24 UTC
Investigating
We are currently investigating slow agent dispatch for running builds and API performance.
Posted Jan 17, 2022 - 02:20 UTC
This incident affected: Agent API, REST API, and Job Queue.