Major Incident - SaaS Customers - 16 June 2022
Incident Report for Squiz
Postmortem

Executive Summary

On the 16th of June 2022 at ~09:17 AEST, Squiz monitoring systems detected a degradation of service affecting some customers hosted on our SaaS platform. Users may have received an error page from the Cloudflare Content Delivery Network (CDN) related to not being able to resolve the origin DNS, occurring for all requests for uncached content.

Investigations performed by our Platform team indicated issues with a few custom hostname records that were pointing to an incorrect Application Load Balancer (ALB) instance. The affected hostname records were manually edited and updated to point to the right staging domains, promoting recovery at 10:12 AEST.

Customer impact

For the duration of the incident, users may have received an error page (Error 1016) from the Cloudflare Content Delivery Network (CDN) with an “error 530 HTTP status code” message related to not being able to resolve the origin DNS, occurring for all requests for uncached content.

Root cause

A deployment to the production instance created corrupted custom hostname records in Cloudflare. The Squiz Platform team identified the affected hostnames and manually updated them to point at the correct domains.

Mitigation and Follow-up actions

In response to this Incident, the Squiz Platform team will undertake the following actions:

  • Set up an additional synthetic monitoring for custom hostnames from multiple locations; this will provide additional monitoring and alerting if a similar issue occurs again.
  • Changes to the release mechanism to add further checks for custom hostname synchronization during code releases.
Posted Jun 17, 2022 - 11:33 AEST

Resolved
Squiz teams have deployed a fix for the current Major Incident which has restored service for all impacted customers. We apologise for this degradation of service and thank you for your patience while we worked on the resolution.

A postmortem will be provided via https://status.squiz.cloud .
Posted Jun 16, 2022 - 10:28 AEST
Identified
Squiz teams have identified the cause of the current degradation of service Major Incident, and are currently working to resolve the issue.

We will provide a further update in ~15 minutes, or once the incident is resolved.
Posted Jun 16, 2022 - 10:19 AEST
Update
Squiz continue to investigate a degradation of service for our SaaS customers.

A further update will be provided in ~15 minutes.
Posted Jun 16, 2022 - 10:13 AEST
Update
Squiz monitoring has detected a degradation of service incident that is affecting Squiz SaaS customers. Multiple Squiz teams are currently investigating.

A further update will be provided in ~15 minutes.
Posted Jun 16, 2022 - 10:01 AEST
Investigating
Squiz monitoring has detected a degradation of service incident that is affecting Squiz Cloud customers. Multiple Squiz teams are currently investigating.

A further update will be provided via https://status.squiz.cloud in 15 minutes, or earlier if the situation or information changes.
Posted Jun 16, 2022 - 09:50 AEST
This incident affected: Squiz Cloud Hosted Instances.