Major Incident - All Squiz Cloud Customers 26th April 2022
Incident Report for Squiz
Postmortem

Executive Summary
On the 26th of April 2022 at 12:53 AEST, Squiz monitoring systems detected a degradation of service affecting some Squiz Edge customers.

Initial investigation by Squiz teams indicated a large number of HTTP 500 errors were being served to end users by the Squiz Edge CDN. Detailed investigations by our Data Centre team determined that a very large number of valid requests were being sent to systems hosted on Squiz Edge infrastructure in a very short period of time, resulting in a degredation of service for content served by the Sydney Edge Node. Our Data Centre team took measures to restore service. The number of HTTP 500 errors served from the Sydney Edge node decreased to normal operational error rates by 13:36 AEST, and after a period of monitoring the incident was closed at 14:00 AEST.

Customer Impact
Impact was to Squiz Edge requests served by the Sydney Edge Node between 12:53 AEST and 14:00 AEST on the 26th of April 2022. During this time some end users may have received HTTP 500 error messages or network timeouts.

Root Cause
The root cause was an extremely high rate of requests for specific content hosted in the Squiz Cloud in a very short period of time. The Squiz Edge CDN can handle bursts of requests at this rate, however in this situation as it was business hours in Australia and New Zealand the Sydney Edge Node was also serving the majority of Squiz Edge traffic at the time. The burst of traffic caused a component of the Edge system in the Sydney Node to suffer performance degredation. This degredation resulted in approximately 7.5% of all requests to the Sydney Edge Node being served HTTP 500 error messages or network timeouts between 12:53 AEST and 14:00 AEST. Once the component was identified the DataCentre Team took remedial action to restore service for the component, which when completed resulted in the number of HTTP 500 errors being served by Edge to decrease to within normal operational rates.

Mitigation and Follow-up Actions
In response to this Incident our Data Centre Team have taken the following actions:

  • Additional monitoring and alerting for certain Edge system components has been deployed;
  • Additional monitoring and alerting for both traffic and error rates at Squiz Edge has been deployed.

If you require a PDF copy of this post incident report please contact your Squiz Service Experience Manager or Squiz Customer Care.

Mat Walker
Major Incident Manager

Squiz Customer Care

Posted Apr 29, 2022 - 10:26 AEST

Resolved
Squiz teams have deployed a fix for the current Major Incident which has restored service for affected Squiz Cloud hosted customers. We apologise for this degradation of service and thank you for your patience while we worked on the resolution.

A postmortem will be provided via https://status.squiz.cloud within 72 hours.
Posted Apr 26, 2022 - 14:00 AEST
Monitoring
Squiz are continuing investigations into reports of intermittent HTTP 500 errors for some Squiz Edge customers. Root cause has been identified and the number of HTTP 500 errors is dropping. Squiz teams are actively monitoring the Edge systems.

A further update will be provided in ~15 minutes.
Posted Apr 26, 2022 - 13:38 AEST
Update
Squiz are continuing investigations into reports of intermittent HTTP 500 errors for some Squiz Edge customers.
Multiple Squiz teams are investigating. Root cause and ETR is unknown at this time.

A further update will be provided in ~15 minutes.
Posted Apr 26, 2022 - 13:36 AEST
Update
Squiz are investigating reports of intermittent HTTP 500 errors for some Squiz Edge customers. Multiple Squiz teams are investigating. Root cause and ETR is unknown at this time.

A further update will be provided in ~15 minutes.
Posted Apr 26, 2022 - 13:18 AEST
Investigating
Squiz monitoring has detected a degradation of service incident that is affecting Squiz Cloud customers. Multiple Squiz teams are currently investigating.

A further update will be provided in ~15 minutes.
Posted Apr 26, 2022 - 13:17 AEST
This incident affected: Squiz Cloud Hosted Instances.