At 14:58 UTC on 14th February 2023, a pending infrastructure upgrade was triggered in the EU region. The change was still undergoing testing, and was applied to production due to human error. Shortly after, we began a failover process. The failover process encountered unexpected problems, which extended the period of instability.
API and SDK endpoints in the EU region were intermittently unavailable between 15:20-17:50 UTC. This resulted in c.30% less Checks being performed than normal during that period.
A pending infrastructure upgrade was planned to be tested in pre-production. Due to human error, the operation was instead triggered in production.
The nature of the upgrade did not allow for rollback of the change, forcing a failover.
Unexpected problems during the failover process, related to inaccurate replication of cluster configuration, extended the incident.
14:58 UTC: An unplanned infrastructure upgrade was triggered in our primary EU cluster. This resulted in intermittent issues routing external traffic to the cluster.
15:07 UTC: We began to progressively failover to a secondary cluster.
15:20 UTC: An increase in error rates ensued, leading to a subsequent period of service instability.
15:30-16:25 UTC: Corrective actions taken to address issues in the failover process and stabilise the service.
16:25-16:55 UTC: Service stability improved, with some residual degradation. Monitoring continues.
16:55 UTC: Residual issues identified specific to manual processing.
17:55 UTC: Fixes completed to resolve residual issues. Service stable.
Amend administrator access permissions to further restrict production infrastructure upgrades (in progress).
Enhance cluster failover process to ensure reliable failover under exceptional conditions (ETA: March 2023).