Increased Transmissions API latency & error rates, email delivery delays and event webhook delays for some US customers
Incident Report for SparkPost
Postmortem

Incident Impact Period: November 23 2020 16:53 PT - 23:38 PT 

Customers provisioned in the US region experienced delays in receiving raw event data via webhooks during a 7 hour period. No data was lost as part of this incident and the standard delivery framework was executed (with retries) for all posts to customers’ webhook endpoints. 

The root cause was a networking issue with our cloud service provider: a network endpoint used to manage and distribute event batches was not fully operational. We are working with our cloud service provider to fully understand this incident and to best protect against this networking problem in the future.

Posted Nov 30, 2020 - 14:00 EST

Resolved
This incident has been resolved.
Posted Nov 24, 2020 - 02:38 EST
Monitoring
The cause was identified and fixed. All API errors are resolved. Event webhook backups are clearing now at a rapid pace. We will give an update when the queues are clear
Posted Nov 24, 2020 - 01:14 EST
Update
We are actively working this incident.
- Webhooks for event data is still delayed - we are several hours behind
- We continue to see slightly elevated error rates on calls to Transmissions API (0.1 - 0.5%)
- A small percentage of engagement events (opens, clicks) are delayed for Events API and Metrics API and in the SparkPost app reports.

We appreciate your patience.
NOTE: This incident does not impact our customers hosted in the EU region.
Posted Nov 24, 2020 - 00:31 EST
Update
Webhook data is still delayed. There is no risk of losing webhook data due to data timing out since the first delivery attempt has not happened yet.
We are continuing to work with our service provider to resolve.

Thank you for your patience.
Posted Nov 24, 2020 - 00:19 EST
Update
Webhook data is catching up however it is about 60 minutes behind. We continue to work with our service provider on this issue.
Engagement data delay rates are very low, most data is available in Events API and Metrics API as usual.
Thank you for your patience.
Posted Nov 23, 2020 - 23:18 EST
Update
We continue to work with our service provider on this issue. Thank you for your patience.
Posted Nov 23, 2020 - 22:28 EST
Update
Engagement event (opens, clicks) event data is also delayed for Events API and Metrics API and in the SparkPost app reports.

We are continuing to work with our service provider on this issue. Thank you for your patience.
Posted Nov 23, 2020 - 21:51 EST
Update
We still see some low levels of 5xx errors with the Transmission API. Please retry all 5xx errors. We continue to work with our service provider on this issue.
Posted Nov 23, 2020 - 21:18 EST
Update
We are continuing to work on a fix for this issue.
Posted Nov 23, 2020 - 21:05 EST
Identified
Transmissions API latency and error rate have resumed to normal levels.
Posted Nov 23, 2020 - 20:47 EST
Update
We are continuing to investigate this issue.
Posted Nov 23, 2020 - 20:19 EST
Update
We are investigating an increase in Transmissions API latency and error rates and delivery latency for some outbound messages. Message injection and outbound message delivery is not impacted - all messages are flowing as expected.
Our webhook data delivery services are running behind and some customers may see a delay in data streamed to their webhook endpoints.
(NOTE: This does not impact our customers hosted in the EU.)

We are working with our service provider to resolve this issue.
Posted Nov 23, 2020 - 20:12 EST
Investigating
We are currently investigating this issue.
Posted Nov 23, 2020 - 19:55 EST
This incident affected: Metrics API (Metrics API - USA), Transmissions API (Transmissions API - USA), Events API (Events API - USA), SparkPost Application (WebUI) (SparkPost Application - USA), SMTP Delivery (Outbound Message Delivery) (SMTP Delivery - USA), and Event Webhook Delivery Service (Event Webhooks - USA).