What happened?
At 17:30 UTC on June 9, 2021, we experienced a spike in CPU usage on our primary database servers. A bulk action in billing groups resulted in the primary database server locking up. Users that were logged in at the time had their sessions terminated and no new login requests could be processed.
Details and corrective actions
At 17:32 UTC we initiated a fail-over to a new primary database server. This was completed at 17:46 UTC, at which time full service was restored with no data loss.
What happened next?
At 23:33 UTC we experienced a similar set of events and the database on the new primary database server became unresponsive.
Details and corrective actions
At 23:42 UTC we initiated a fail-over to another new primary database server. This was completed at 23:56 UTC, at which time full service was restored with no data loss.
Summary
The first outage resulted in 16 minutes of downtime, with another 23 minutes of downtime for the second outage.
What we’ve done to avoid this happening again
Following the outage on May 19, we performed a series of maintenance operations over the weekend of May 22-23 to significantly improve our database performance. This reduced the likelihood of a similar outage in future, whilst not entirely eliminating it.
As an immediate remedial action, we have temporarily disabled the bulk change functionality in billing groups (an Enterprise Network feature). This functionality is being refactored and will be re-released as soon as possible.
We are also conducting a comprehensive review of bulk operations within the MURAL platform to ensure this cannot happen again.