Squiz Cloud Incident - 23rd November 2021
Incident Report for Squiz
Postmortem

Executive Summary 

At 21:21 PST 2021-11-22 Squiz monitoring systems detected timeouts for all customers and services hosted in our Sacramento Data Centre (DC) in the US. This prompted an incident response from Squiz. 

Investigations performed by our Platform team indicated issues with one of the routers in use in Sacramento DC. Our Platform team manually routed the network traffic through another router, which restored services at 21:45 PST 2021-11-22.

Customer Impact 

All services and applications hosted in the Sacramento DC were unavailable between 21:21 PST 2021-11-22 and 21:45 PST 2021-11-22.

Root Cause 

A standard configuration deployment to a router that makes up part of the network in the Sacramento DC caused the management subsystem to become unresponsive, stopping all network traffic from being able to traverse the router. This change is a low risk standard operation that is carried out multiple times a day without effect, and it is applied with an automatic rollback process in case of error. The automatic rollback process failed, and attempts to manually revert the change also failed, which then resulted in the manual re-routing of traffic to resolve the issue. Squiz are working with our hardware and software partners to investigate why a routine operation resulted in an outage.

Residual Impact 

There is no residual impact or risk to recurrence known at this time. The affected hardware has been bought back into operation and being monitored by Squiz teams.

If you require a PDF copy of this post incident report please contact your Squiz Customer Support Manager.

Kind regards,

Squiz Customer Care

Posted Nov 24, 2021 - 13:45 AEDT

Resolved
Squiz teams have deployed a fix for the current Major Incident which has restored service for all Squiz Cloud hosted customers in our Sacramento Data Centre (DC). We apologise for this degradation of service and thank you for your patience while we worked on the resolution.

A postmortem will be provided via https://status.squiz.cloud within 24 hours.
Posted Nov 23, 2021 - 16:00 AEDT
Identified
Squiz teams have identified the cause of the current degradation of service Major Incident, and are currently working to resolve the issue.

We will provide a further update in ~15 minutes, or once the incident is resolved.
Posted Nov 23, 2021 - 15:54 AEDT
Update
Squiz monitoring has detected a degradation of service incident that is affecting Squiz Cloud customers hosted in the USA ONLY. Multiple Squiz teams are currently investigating.

A further update will be provided in ~15 minutes.
Posted Nov 23, 2021 - 15:43 AEDT
Update
We are continuing to investigate this issue.
Posted Nov 23, 2021 - 15:37 AEDT
Investigating
Squiz monitoring has detected a degradation of service incident that is affecting Squiz Cloud customers. Multiple Squiz teams are currently investigating.

A further update will be provided via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.
Posted Nov 23, 2021 - 15:36 AEDT
This incident affected: Squiz Cloud Hosted Instances.