Users are unable to sign In to MURAL

Incident Report for Mural

Postmortem

What happened?

At 17:30 UTC on June 9, 2021, we experienced a spike in CPU usage on our primary database servers. A bulk action in billing groups resulted in the primary database server locking up. Users that were logged in at the time had their sessions terminated and no new login requests could be processed.

‌

Details and corrective actions

At 17:32 UTC we initiated a fail-over to a new primary database server. This was completed at 17:46 UTC, at which time full service was restored with no data loss.

‌

What happened next?

At 23:33 UTC we experienced a similar set of events and the database on the new primary database server became unresponsive.

‌

Details and corrective actions

At 23:42 UTC we initiated a fail-over to another new primary database server. This was completed at 23:56 UTC, at which time full service was restored with no data loss.

‌

Summary

The first outage resulted in 16 minutes of downtime, with another 23 minutes of downtime for the second outage.

‌

What we’ve done to avoid this happening again

Following the outage on May 19, we performed a series of maintenance operations over the weekend of May 22-23 to significantly improve our database performance. This reduced the likelihood of a similar outage in future, whilst not entirely eliminating it.

As an immediate remedial action, we have temporarily disabled the bulk change functionality in billing groups (an Enterprise Network feature). This functionality is being refactored and will be re-released as soon as possible.

We are also conducting a comprehensive review of bulk operations within the MURAL platform to ensure this cannot happen again.

Posted Jun 15, 2021 - 19:26 GMT-03:00

Resolved

The fix we implemented has proven to be successful and MURAL has remained stable. We are closing this incident and starting our analysis into the cause of this issue.

We are very sorry for the service interruption and will be taking steps to mitigate this in future.

Posted Jun 09, 2021 - 16:33 GMT-03:00

Monitoring

We have successfully implemented a fix and full service has been restored. We will continue to monitor the situation to ensure this remains stable.

Posted Jun 09, 2021 - 14:50 GMT-03:00

Identified

We have identified the cause of this service interruption and are working to resolve this as quickly as possible.

Posted Jun 09, 2021 - 14:42 GMT-03:00

Investigating

Users are currently unable to sign in to MURAL. We know this is a major service disruption for everyone. We're investigating the issue and will restore regular service ASAP.

Please check our status page for the most up-to-date info 👉 status.mural.co/

Posted Jun 09, 2021 - 14:35 GMT-03:00

This incident affected: Mural Application (Authentication).