Starting at ~19:00 UTC on June 6th, 2023, new LKE clusters were unable to be created across all regions. Existing clusters were not affected and continued to run normally during this time. After an initial investigation to determine if this situation was linked to another issue, our incident response procedures were formally initiated to treat this as a separate event at 20:34 UTC. A status page was created at 20:58 UTC once the scope of the impact on customers had been identified.
During the following investigation, it was determined that the event was triggered by internal testing of new infrastructure, which resulted in a cascading failure across multiple data centers. Once the cause was determined, the offending infrastructure was turned down, and services were considered to be restored by 21:36 UTC. After further monitoring, the incident was moved to a resolved status at 0:03 UTC on June 7th.
In response to this event, our administrators are implementing steps to improve internal processes for testing new environments. This will contribute greatly to preventing failures of this fashion in the future.