Delays in starting some jobs

Incident Report for CircleCI

Postmortem

Summary

On May 1, 2025, from 22:20 UTC to May 2, 2025 02:00 UTC, CircleCI customers experienced delays in starting most jobs. Jobs affected were contained to the following resource classes: Docker large, Docker medium, Docker small and Linux large. During this time customers may have also experienced delays in obtaining status checks.

What Happened

(all times UTC)

At approximately 22:05 on May 1, 2025, we initiated a database upgrade to the service that dispatches jobs. We used a blue/green deployment to stand up a second database running the upgraded version and use logical replication to keep the data across the two databases in sync. We had been running the blue (old version) and the green (new version) without issues for a couple days and replication was confirmed to be in sync when we triggered the cut over from blue to green. Upon completion of the cutover process, we noticed application errors for jobs, which meant the application pods failed to automatically pickup the new DNS route. A rolling restart of the pods was performed, and all pods were back online with no further application errors as of 22:17.

At 22:40, teams were alerted that Docker jobs were backing up. They initially investigated if the pod restarts caused fewer processing nodes to be online, and began to manually scale up the nodes. At 23:47, it was confirmed only a small quantity of jobs were making it through to the processing pods, causing the backlog and ruling out an infrastructure issue. It was determined that jobs in the following resource classes were not executing: Docker large, Docker medium, Docker small and Linux large.

At 00:40 on May 2, 2025, orphaned task records for the above mentioned resource classes were identified. An orphaned task record is an item of work with no associated jobs, as these records were picked up by the service it causes a failure preventing the next record from being picked up. The team updated the task status to “completed” and immediately saw more jobs processing and the backlog of jobs dropped. By 00:45, the backlog of jobs had completely cleared and the issue was thought to be remediated.

At 00:56 UTC, an alert triggered, warning of a backlog of jobs once again. Upon investigation, it was determined that only some Docker resource classes were affected. These included large, medium and small, all other resource classes including Linux jobs were operating as expected. An investigation determined additional orphaned task records had been written to the database after 00:40. Logical replication was manually disabled and the orphaned task records were updated at 01:55. At 02:10 the backlog of jobs had once again cleared. The team continued to monitor over the following hour with no additional occurrences of orphaned tasks and declared the incident closed at 03:39.

Post-incident, the team continued to investigate. The root cause was determined to be a race condition between the application and logical replication when the application pods were restarted. A task event was rerun and wrote to the green (new) database before the original task event status was replicated from the blue (old) database. This created a unique constraint error that broke replication. Because logical replication does not respect foreign key constraints, task records were replicated to the green database which were older than those already in the green database, creating the orphaned task records seen during the incident. The issue resurfaced immediately after draining the job queue as the failed replication task tried to restart.

Future Prevention and Process Improvement

The incident has exposed the need to implement further controls on database writes during the upgrade process while using logical replication. Even if replication is in sync, the milliseconds network delay incurred in transferring the data can be enough to trigger this scenario.

  1. We will update the upgrade procedure to limit writes to the database for a short period of time while logical replication writes the final updates from the old database version to the new version.
  2. A second data replication verification test will be added to the procedure before turning writes on for the new version.
  3. Once replication is confirmed, in sync replication will be disabled to avoid any possibilities of conflicts.
  4. We will be implementing a more in depth review process between the database and service owner teams to review the upgrade process and risks prior to performing the change.

We sincerely apologize for the disruption this incident caused to your ability to build on our platform. We understand the critical role CircleCI plays in your development workflow and take any service disruption seriously. We're committed to learning from this experience and have already implemented several measures to prevent similar occurrences in the future.

Thank you for your patience and continued trust in CircleCI.

Posted 7 days ago. May 22, 2025 - 16:00 UTC

Resolved

All jobs are now running normally. Thank you for your patience whilst we resolved the issue.
Posted 27 days ago. May 02, 2025 - 00:45 UTC

Update

We are continuing to monitor for any further issues.
Posted 27 days ago. May 02, 2025 - 00:36 UTC

Update

Jobs for the following resource classes will have suffered significant delays in running, these will be processed over the next X minutes.

* Docker Large, Medium and Small
* Linux Large

Those jobs will start within the next 15 minutes, you should not need to retry them. We thank you for your patience whilst we resolve this issue.
Posted 27 days ago. May 02, 2025 - 00:29 UTC

Update

We're continuing to monitor the delays with starting Docker jobs. Thank you for your patience.
Posted 27 days ago. May 01, 2025 - 23:43 UTC

Update

Docker jobs have not recovered as expected, and customers may continue to see delays for Docker jobs starting. We are working to increase capacity and thank you for your patience.
Posted 27 days ago. May 01, 2025 - 23:05 UTC

Update

This incident impacted final result delivery between 22:06 and 22:17 UTC. Customers may experience delays starting Docker Large jobs as the system recovers. We will continue to monitor recovery and thank you for your patience.
Posted 27 days ago. May 01, 2025 - 22:49 UTC

Monitoring

This also impacts status checks which may not have been sent to GitHub.
Posted 27 days ago. May 01, 2025 - 22:40 UTC
This incident affected: Docker Jobs, Machine Jobs, macOS Jobs, and Windows Jobs.