On March 27th at 5:45 UTC, a sudden system failure caused a number of sites to experience degraded service due to unavailable CMS databases. Content cached on our Global CDN and static file access were unaffected.
Our on call team identified and applied the appropriate response to remediate the failure. Our typical accepted median time to restore service is 15 minutes with data loss being limited to uncommitted transactions. In this instance, the failure mode was new which increased the time to restore service by an order of magnitude. Furthermore, some customers experienced database corruption during the remediation which increased their data loss from the expected uncommitted transactions to the period since their most recent backup (either automated or manual).
We have conducted our first internal review of the incident and have notified affected customers. We will be investigating further. We identified improvements that should be made in both our documentation and tooling. We will be engaging our reliability engineering processes to determine our next steps in improving the median time between failures in addition to the median time to recovery to provide our customers the worry free experience they expect with Pantheon.
We recognize the criticality of this service and we appreciate your patience as we improve our systems.