Degraded Performance
Incident Report for Pantheon Operations
Postmortem

On March 27th at 5:45 UTC, a sudden system failure caused a number of sites to experience degraded service due to unavailable CMS databases. Content cached on our Global CDN and static file access were unaffected.

Our on call team identified and applied the appropriate response to remediate the failure. Our typical accepted median time to restore service is 15 minutes with data loss being limited to uncommitted transactions. In this instance, the failure mode was new which increased the time to restore service by an order of magnitude. Furthermore, some customers experienced database corruption during the remediation which increased their data loss from the expected uncommitted transactions to the period since their most recent backup (either automated or manual).

We have conducted our first internal review of the incident and have notified affected customers. We will be investigating further.  We identified improvements that should be made in both our documentation and tooling. We will be engaging our reliability engineering processes to determine our next steps in improving the median time between failures in addition to the median time to recovery to provide our customers the worry free experience they expect with Pantheon.

We recognize the criticality of this service and we appreciate your patience as we improve our systems.

Posted Apr 06, 2021 - 12:07 PDT

Resolved
This incident has been resolved. Please open a support chat/ticket if you still continue experiencing any issues.
Posted Mar 27, 2021 - 17:21 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 17:06 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 16:09 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 15:09 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 14:01 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 13:00 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 12:00 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 11:00 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 10:00 PDT
Update
We are continuing to work on a fix for this issue.
Posted Mar 27, 2021 - 09:00 PDT
Update
A fix is currently being rolled out.
Posted Mar 27, 2021 - 08:00 PDT
Update
We are still implementing a fix on the database endpoint.
Posted Mar 27, 2021 - 07:03 PDT
Update
We are still implementing a fix on the database endpoint
Posted Mar 27, 2021 - 06:12 PDT
Update
We are still implementing a fix on the database endpoint
Posted Mar 27, 2021 - 04:42 PDT
Update
We are still implementing a fix on the database endpoint
Posted Mar 27, 2021 - 03:36 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 27, 2021 - 02:28 PDT
Update
We are still investigating a failed database endpoint.
Posted Mar 27, 2021 - 01:26 PDT
Investigating
We are investigating a failed database endpoint.
Posted Mar 26, 2021 - 23:54 PDT
This incident affected: Customer Sites.