Problems with traffic from Segment to Customer.io

Incident Report for Customer.io Status

Postmortem

Incident Summary

One of our clients, that integrates with us through Segment, started sending us large amounts of anonymous event data on June 17th at about 13:35 UTC. The increase in traffic was so sudden and the volume so high that our dedicated infrastructure that handles traffic from Segment failed to scale up fast enough to cope with it. As a result of this, Customer.io started failing to serve requests originating from Segment. This caused delays in processing data and affected all Customer.io clients that use Segment to send data to us.

We worked directly with Segment to mitigate the effects of this event spike in order to restore normal operations. In addition, we've made infrastructure changes that have increased our overall capacity and ability to respond to sudden spikes in traffic.

These 2 actions led to stabilising our infrastructure on June 17th 20:50 UTC. At 22:20 UTC we successfully processed all backlogged data, ending any delays and the incident ended.

Root Cause

The sudden introduction of high volume of traffic overcame the ability of our infrastructure to upscale. The problem was made worse by Segment repeatedly re-trying failed requests leading to an amplifying effect.

Resolution and Recovery

The suspension by Segment of the traffic that was causing high load in combination with the changes we introduced to increase our capacity allowed us to recover and process all traffic and restore service.

Corrective and Preventative Measures

We have increased the capacity of our infrastructure to be able to better cope with sudden bursts of traffic. We will continue to work on our ability to better respond to similar issues.

Posted Jun 23, 2020 - 17:34 UTC

Resolved

All clear! Our team deployed additional improvements to clear the backlog of delayed API traffic from Segment at 20:45 UTC.

Since then we've been monitoring the recovery and at 22:18 UTC we cleared the backlog!

A full post-mortem will follow in the coming days after our team has had a chance to review all the details of this incident.

Posted Jun 17, 2020 - 22:29 UTC

Update

Our team has been working directly with Segment to mitigate the source of the delays sending data into Customer.io via Segment.
A fix has been deployed both sides and our teams are monitoring for a resolution and return to normal.
At this time the delays are on-going as we work to process the backlog of data

Posted Jun 17, 2020 - 19:54 UTC

Monitoring

The issue has been identified and a fix is in place. Traffic has resumed and we are closely monitoring.

Posted Jun 17, 2020 - 15:12 UTC

Investigating

We are currently investigating an issue with traffic sent to Customer.IO through Segment.

Posted Jun 17, 2020 - 14:30 UTC

This incident affected: Data Collection.