On September 6th, 2022, Test Analytics was impacted by a database unavailability outage for a period of 6 hours and 33 minutes, preventing access to the Test Analytics application and causing our ingestion of test executions to stop.
We have been working closely with our partners at AWS to fully understand the root cause of this issue and to put in place multiple layers of mitigation to prevent incidents like this from happening in the future.
The Test Analytics Aurora cluster consists of a single writer instance and multiple reader instances for load balancing and failovers in the case of an incident.
At 3:59 am UTC, the team was paged due to a “Database Restarted” event. Upon further inspection, the team noticed that once the restart procedure happened, there would be errors when trying to fetch critical configuration information which prevented database servers from restarting cleanly, including our main instance.
We attempted to get the application back online as soon as possible by promoting other replicas as writers, but they would exhibit the same behavior which caused the entire database to be unavailable — effectively letting us know that whatever had caused this issue had replicated across instances.
After noticing this, the team executed a Point-In-Time restore (PITR) from around 20 seconds before the incident had occurred to minimize data loss. Creating entirely new clusters and restoring a database snapshot to a specific point in time is a much more time-consuming process than promoting a reader instance, but at this time we believed that that would be the fastest way to get back online.
Unfortunately, the new database cluster experienced the same restart loop as the malfunctioning cluster. This forced us to restore to a much earlier backup (at around 1:18am UTC) at the request of AWS Support to ensure that this new cluster would not exhibit the same issues. Once the restore was complete, this allowed us to connect into our database, delete the corrupted index, and recreate a healthy one, then get the application back online.
However, this also meant that data from 1:18am UTC until around 3:59am UTC wasn’t included. This data is currently not accessible through the Test Analytics Web UI, but we have recovered it and will attempt to add it back into our archives in the future.
The Test Analytics application supports multibyte UTF-8 characters (such as Emoji or Kanji). PostgreSQL uses a system-level library for collation-related functions like sorting and querying multibyte strings. Before the incident, an automated OS-level update to the GNU C Library (glibc) changed the collation ordering of a small set of multibyte characters. This included changes around kanji characters and how they’re processed and interpreted. This means that traversing the B-tree index would happen differently before and after this upgrade if certain multibyte characters were indexed.
A combination of these factors created a situation where an INSERT operation that would normally have been rejected by the database engine due to violating uniqueness constraints succeeded. Multiple duplicate records were inserted, so the B-tree unique index became corrupted, affecting both the writer instance and all the readers. Instead of restoring from a healthy snapshot, the team was effectively recreating instances and clusters with already corrupted indices.
With the support of AWS, we have upgraded the underlying PostgreSQL version to one that prevents inserting bad data into the B-tree index if the result will cause corruption. The open-source PostgreSQL codebase contains this logic as an assertion that is only active when running in debug mode.
Related to this, AWS is working with the PostgreSQL community to decouple the index collation and sorting from OS-level libraries. There are already conversations around this issue in the PostgreSQL mailing list.
Finally, we’re using pg_amcheck to check all of our indices across Buildkite for corruption, and we already have corrective mechanisms in place in case it detects that indices run the risk of corruption.
We’re beyond sorry for the disruption this may have caused. We want Test Analytics to be a tool that is reliably integrated in your workflow and your engineering day to day, which is why we take situations like this incredibly seriously. We would also like to take this opportunity to thank AWS for their support during this incident, and their diligent involvement with us and the community to improve resilience.