System Outage
Incident Report for Customer.io Status
Postmortem

Incident Summary

On Monday January 31st between 19:15 and 20:20 UTC a small subset of our services in both our US and EU datacenters experienced issues related to domain name resolution. This resulted in some of our services being unable to communicate with each other. Initially there was a low rate of intermittent failures in our UI. Beginning at 19:50 UTC, our outbound messaging service experienced an outage. We resolved the issue by 20:10 UTC in our US datacenter and by 20:20 UTC in our EU datacenter.

This issue did not affect inbound data ingestion to our system. Once the issue was resolved outbound message delivery resumed and no messages were lost as they were queued during the incident.

Customer.io would like to apologize for the impact of this outage. We are committed to learn from this event and use it to drive improvement across our services.

Root Cause

A recent DNS change did not propagate correctly across our environment. This caused a subset of our services to be misconfigured and resulted in a lack of network communication between some services due to failing name resolution.

Resolution and Recovery

Once the DNS issue was identified we updated these settings and network communication was restored between the affected services. Queued outbound messages were delivered and UI issues were resolved.

Corrective and Preventative Measures

We have identified the reason why DNS settings did not propagate to all of our services and a code fix has been implemented to prevent this same issue from reoccurring. We are working to improve our monitoring thresholds to accelerate response times.

Posted Feb 04, 2022 - 12:15 UTC

Resolved
Everything is back to normal.

We plan to release a postmortem of this incident soon. If you'd like to be notified when this update is published, you can subscribe to our status updates via the "subscribe to updates" feature of our status page: https://status.customerio.com
Posted Jan 31, 2022 - 20:40 UTC
Monitoring
We've identified and fixed a DNS issue across all our services. We will continue to monitor the situation as the resulting backlog clears. There was no loss of data.
Posted Jan 31, 2022 - 20:27 UTC
Update
The US data center is back up and we are monitoring the situation there. Investigation is still underway for the EU data center.
Posted Jan 31, 2022 - 20:22 UTC
Update
We are continuing to investigate this issue.
Posted Jan 31, 2022 - 20:11 UTC
Investigating
We are currently investigating an outage related. We will provide more information via this status as we have more to share.
Posted Jan 31, 2022 - 20:11 UTC
This incident affected: Data Collection, Data Processing, and Management Interface.