On Monday January 31st between 19:15 and 20:20 UTC a small subset of our services in both our US and EU datacenters experienced issues related to domain name resolution. This resulted in some of our services being unable to communicate with each other. Initially there was a low rate of intermittent failures in our UI. Beginning at 19:50 UTC, our outbound messaging service experienced an outage. We resolved the issue by 20:10 UTC in our US datacenter and by 20:20 UTC in our EU datacenter.
This issue did not affect inbound data ingestion to our system. Once the issue was resolved outbound message delivery resumed and no messages were lost as they were queued during the incident.
Customer.io would like to apologize for the impact of this outage. We are committed to learn from this event and use it to drive improvement across our services.
A recent DNS change did not propagate correctly across our environment. This caused a subset of our services to be misconfigured and resulted in a lack of network communication between some services due to failing name resolution.
Once the DNS issue was identified we updated these settings and network communication was restored between the affected services. Queued outbound messages were delivered and UI issues were resolved.
We have identified the reason why DNS settings did not propagate to all of our services and a code fix has been implemented to prevent this same issue from reoccurring. We are working to improve our monitoring thresholds to accelerate response times.