Coveralls is in Read Only mode while we work on updating the system. Sorry for the inconvenience.

VIGILANCE! Check this page Any time you notice a problem with coveralls

Hanging status updates

Incident Report for Coveralls

Postmortem

Postmortem:

We want to share a postmortem on this incident since it took us an unusually long time to identify its root cause and resolve it, and since it affected an unusually large number of users throughout its course.

Summary:

The cause of this incident was a failure to allocate sufficient resources to, or put in place sufficient monitoring of, an existing background job queue after assigning a new background job to it. To avoid incidents of this type in the future, we have implemented a pre-deploy process for features entailing new background jobs, which is something we’ve done less and less frequently over the past number of years as our codebase and infrastructure have matured.

Cause of incident:

We deployed an optimization earlier in the week last week (Mon, Apr 1) meant to address Gateway Timeout errors experienced by a small number of customers with massively parallel builds (builds with hundreds of parallel jobs).
As part of this optimization, we moved a common process, “Job creation,” to a new background job and, in a mindset of "this is an experiment, let's see how it goes," chose a readily available (ie. traffic-free) queue (our default queue), released it to production, and watched it for a day and a half with good results. The change resolved the issue we aimed to fix, and all looked good from the standpoint of error tracking and performance.
Unfortunately, while we considered traffic in our selection of a queue during initial implementation, we did not consider the need to create a permanent, dedicated queue for the new background job (which also represented a new class of background job), nor did we, after seeing good performance on Mon-Tue, evaluate the need to change any configuration details for our default queue, which turned out to be not only insufficiently resourced, but also insufficiently monitored.
As a result, later in the week when we entered our busiest period (Wed-Thu), the new queue backed up. But we didn't know it because we didn't have visibility, and, since the nature of the new background job (Job creation) was such that it preceded a full series of subsequent jobs, it began acting as a gateway mechanism, artificially limiting traffic to downstream queues, which were being monitored, where, of course, everything looked hunky-dory across all of those metrics.
By the time we realized what was going on, we had 35K jobs stuck in the newly utilized queue.
At that point, the issue was easy to fix---first, by scaling up, and then by allocating proper resources to the new queue going forward---but for most of the day we did not understand what was going on so it caused problems for those hours and, as backed-up jobs accrued, affected a growing number of users as time ticked by.

A‌ctions taken to avoid future incidents of this type:

Hindsight being 20/20, we clearly could have avoided this incident with a little more process around deploys of certain types of features---in particular, features entailing the creation of new background jobs (something we had not done in any significant way for over a year prior).

As avoidable as the initial misstep here was, its impact was great in the way it led us to miss the true underlying issue for most of an 18 hour period---which is just not acceptable in a production environment.

In response to this incident, we have added the following new step to our deployment process:

Prior to deployment, if changes entail the creation of any new background jobs, or modification of any existing background jobs, we must evaluate the need to update our Sidekiq configuration, including the creation of any new workers or worker groups.

We’ve been operating Coveralls.io for over 13 years now, but we are, of course, far from perfect in doing so, and, clearly, we still make mistakes. While mistakes are probably unavoidable, our main goal in addressing them is to try not to make the same mistake twice. This was a new one for us (or at least new in recent years for our current team), and it has caused us to shore up our SOPs around deploys in a way that should reduce this type of incident in the future.

Posted Apr 08, 2024 - 12:30 PDT

Resolved

All queues are cleared. As a result, all previously reported delayed builds and or status updates should now be complete / received.

We are not seeing any further backups in any queues, but will continue monitoring into morning when our usage increases.

If you are still experiencing any unfinished builds, or delayed status updates, please reach out and let us know at support@coveralls.io.

Posted Apr 04, 2024 - 21:11 PDT

Update

All backed up queues are fully drained. There is now a flurry of activity in some associated queues, which are completing the processing and notifications of previously delayed builds, but those are processing quickly and we expect any and all builds and notifications previously reported as delayed today to be finished and complete in the next 30-45 minutes.

Our fix has been fully deployed and we will be monitoring for any further backups.

Posted Apr 04, 2024 - 20:48 PDT

Update

The backed up queue affecting all users has drained by 75%. Our fix is still being deployed across all servers, but should start taking effect in the next 15-20 min.

Posted Apr 04, 2024 - 20:38 PDT

Monitoring

We have scaled up processes on clogged background queues and they are draining. We have also implemented a fix we hope will avoid further backups and are monitoring for effects.

Posted Apr 04, 2024 - 20:08 PDT

Identified

We have identified the root cause of delayed status updates for some repos (reported today) as backups in several queues that process background jobs pertaining to aggregate coverage calculations for new builds, which precede the sending of notifications and are therefore delaying those.

However, we have not yet identified a pattern behind these spikes or the delays in processing these queues since none of our usual performance metrics had been triggered (until recently when a queue that affects all users triggered an alarm).

We are scaling up server processes to clear that backup, but since we are not seeing degraded performance metrics from servers, we are continuing to investigate other causes for delayed processing.

Posted Apr 04, 2024 - 19:39 PDT

Investigating

Several customers have reported long delays receiving status updates for new builds at GitHub, or status updates that have hung and never arrived. We are investigating the issue.

If you are experiencing this issue, please reach out and let us know at support@coveralls.io so we can include your cases in our investigation.

Note that there were some incidents receiving API requests at GitHub in the last 24 hrs, per this status update from GitHub:
https://www.githubstatus.com/incidents/gqj5jrvzjb5h

We evaluating cases against this timeframe to understand if they align with the GitHub incident period.

Posted Apr 04, 2024 - 13:00 PDT

This incident affected: Coveralls.io Web, Coveralls.io API, and GitHub.