Incident Summary
From 17:13 UTC on Thursday, July 8th to 21:24 UTC on Friday, July 9th and from 19:38 to 23:13 UTC on Monday, July 19th our streaming data service cluster in the NYM datacenter crashed due to a faulty configuration being introduced. Efforts to reconnect to the cluster and remediate the issue were unsuccessful due to the volume of concurrent reconnection requests.
Scope of Impact
During the incident window, customer may have experienced some or all of the following:
Timeline (UTC)
2021-07-08 17:13: Incident Started
2021-07-08 17:38: Incident Ticket Created
2021-07-08 19:32: Incident Ticket Escalated
2021-07-08 20:44: First attempt to execute a configurational change to the data streaming service
2021-07-08 21:02: Second attempt to execute a configurational change to the data streaming service
2021-07-08 22:36: Third attempt to execute a configurational change to the data streaming service
2021-07-08 23:16: Fourth attempt to execute a configurational change to the data streaming service
2021-07-09 03:26: Traffic filtered to mitigate surge of reconnection requests to data streaming service
2021-07-09 21:24: Incident Resolved: Whitelisting of all applications completed.
2021-07-19 19:38: Incident Re-opened
2021-07-19 19:57: Engineering recovers and brings data streaming service back online
2021-07-19 23:13: Incident Resolved: Streaming service running and serving all client requests.
Cause Analysis
The root cause of the incident was due to the failure of our data streaming service cluster in the NYM datacenter, which crashed due to a faulty configuration being introduced. Efforts to reconnect to the cluster and remediate the issue were unsuccessful due to the volume of concurrent reconnection requests.
Resolution Steps
Our engineers resolved the issue by preventing a surge of concurrent reconnection requests through a whitelisting process enabling small pools of clients to connect to the data streaming service cluster.
Next Step(s)
The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.
We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.
We have identified the following issue:
Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.