One of our clients, that integrates with us through Segment, started sending us large amounts of anonymous event data on June 17th at about 13:35 UTC. The increase in traffic was so sudden and the volume so high that our dedicated infrastructure that handles traffic from Segment failed to scale up fast enough to cope with it. As a result of this, Customer.io started failing to serve requests originating from Segment. This caused delays in processing data and affected all Customer.io clients that use Segment to send data to us.
We worked directly with Segment to mitigate the effects of this event spike in order to restore normal operations. In addition, we've made infrastructure changes that have increased our overall capacity and ability to respond to sudden spikes in traffic.
These 2 actions led to stabilising our infrastructure on June 17th 20:50 UTC. At 22:20 UTC we successfully processed all backlogged data, ending any delays and the incident ended.
The sudden introduction of high volume of traffic overcame the ability of our infrastructure to upscale. The problem was made worse by Segment repeatedly re-trying failed requests leading to an amplifying effect.
The suspension by Segment of the traffic that was causing high load in combination with the changes we introduced to increase our capacity allowed us to recover and process all traffic and restore service.
We have increased the capacity of our infrastructure to be able to better cope with sudden bursts of traffic. We will continue to work on our ability to better respond to similar issues.