On May 1, 2025, from 22:20 UTC to May 2, 2025 02:00 UTC, CircleCI customers experienced delays in starting most jobs. Jobs affected were contained to the following resource classes: Docker large, Docker medium, Docker small and Linux large. During this time customers may have also experienced delays in obtaining status checks.
(all times UTC)
At approximately 22:05 on May 1, 2025, we initiated a database upgrade to the service that dispatches jobs. We used a blue/green deployment to stand up a second database running the upgraded version and use logical replication to keep the data across the two databases in sync. We had been running the blue (old version) and the green (new version) without issues for a couple days and replication was confirmed to be in sync when we triggered the cut over from blue to green. Upon completion of the cutover process, we noticed application errors for jobs, which meant the application pods failed to automatically pickup the new DNS route. A rolling restart of the pods was performed, and all pods were back online with no further application errors as of 22:17.
At 22:40, teams were alerted that Docker jobs were backing up. They initially investigated if the pod restarts caused fewer processing nodes to be online, and began to manually scale up the nodes. At 23:47, it was confirmed only a small quantity of jobs were making it through to the processing pods, causing the backlog and ruling out an infrastructure issue. It was determined that jobs in the following resource classes were not executing: Docker large, Docker medium, Docker small and Linux large.
At 00:40 on May 2, 2025, orphaned task records for the above mentioned resource classes were identified. An orphaned task record is an item of work with no associated jobs, as these records were picked up by the service it causes a failure preventing the next record from being picked up. The team updated the task status to “completed” and immediately saw more jobs processing and the backlog of jobs dropped. By 00:45, the backlog of jobs had completely cleared and the issue was thought to be remediated.
At 00:56 UTC, an alert triggered, warning of a backlog of jobs once again. Upon investigation, it was determined that only some Docker resource classes were affected. These included large, medium and small, all other resource classes including Linux jobs were operating as expected. An investigation determined additional orphaned task records had been written to the database after 00:40. Logical replication was manually disabled and the orphaned task records were updated at 01:55. At 02:10 the backlog of jobs had once again cleared. The team continued to monitor over the following hour with no additional occurrences of orphaned tasks and declared the incident closed at 03:39.
Post-incident, the team continued to investigate. The root cause was determined to be a race condition between the application and logical replication when the application pods were restarted. A task event was rerun and wrote to the green (new) database before the original task event status was replicated from the blue (old) database. This created a unique constraint error that broke replication. Because logical replication does not respect foreign key constraints, task records were replicated to the green database which were older than those already in the green database, creating the orphaned task records seen during the incident. The issue resurfaced immediately after draining the job queue as the failed replication task tried to restart.
The incident has exposed the need to implement further controls on database writes during the upgrade process while using logical replication. Even if replication is in sync, the milliseconds network delay incurred in transferring the data can be enough to trigger this scenario.
We sincerely apologize for the disruption this incident caused to your ability to build on our platform. We understand the critical role CircleCI plays in your development workflow and take any service disruption seriously. We're committed to learning from this experience and have already implemented several measures to prevent similar occurrences in the future.
Thank you for your patience and continued trust in CircleCI.