Executive Summary
At 21:21 PST 2021-11-22 Squiz monitoring systems detected timeouts for all customers and services hosted in our Sacramento Data Centre (DC) in the US. This prompted an incident response from Squiz.
Investigations performed by our Platform team indicated issues with one of the routers in use in Sacramento DC. Our Platform team manually routed the network traffic through another router, which restored services at 21:45 PST 2021-11-22.
Customer Impact
All services and applications hosted in the Sacramento DC were unavailable between 21:21 PST 2021-11-22 and 21:45 PST 2021-11-22.
Root Cause
A standard configuration deployment to a router that makes up part of the network in the Sacramento DC caused the management subsystem to become unresponsive, stopping all network traffic from being able to traverse the router. This change is a low risk standard operation that is carried out multiple times a day without effect, and it is applied with an automatic rollback process in case of error. The automatic rollback process failed, and attempts to manually revert the change also failed, which then resulted in the manual re-routing of traffic to resolve the issue. Squiz are working with our hardware and software partners to investigate why a routine operation resulted in an outage.
Residual Impact
There is no residual impact or risk to recurrence known at this time. The affected hardware has been bought back into operation and being monitored by Squiz teams.
If you require a PDF copy of this post incident report please contact your Squiz Customer Support Manager.
Kind regards,
Squiz Customer Care