UI/API Login Delays and Errors
Incident Report for Xandr
Postmortem

Incident Summary

From approximately 11:00 to 18:09 UTC on Thursday January 28, 2021, users were unable to access Console UI and API due to an unexpected increase in database load.

Scope of Impact

During the incident window, users were unable to authenticate to Console UI and API.

Timeline (in UTC)

2021-01-28 11:00: Incident started : Spike in traffic from different sources

2021-01-28 14:05: Internal incident alerts

2021-01-28 14:06: Incident is escalated to engineering

2021-01-28 16:40: Key sources of high authentication traffic been shut down, authentication endpoints return to normal service. Engineers continue to monitor

2021-01-28 18:09: Incident fully resolved. Console UI and API fully restored

Cause Analysis

Increased authentication traffic triggered related databases to be overloaded. This affected the systems utilized to shed load in terms of high usage, and caused authenticate traffic to either be dropped, or extremely slowed down.

Resolution Steps

Our engineers and support teams worked to identify sources of extreme authentication traffic and to stop it temporarily. When the load dropped, database optimizations were put in place to prevent slowdowns to the load shedding systems.

Next Steps

Identification of other potential performance hotspots in authentication

Improved ability to mitigate and shed load from sources of unusually high authentication traffic

Expanding metrics and monitoring to better identify large sources of traffic to authentication

Posted Feb 09, 2021 - 22:14 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Jan 28, 2021 - 16:57 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Jan 28, 2021 - 16:51 UTC
Identified

We have identified the cause of the issue, and our engineers are actively working towards a resolution. We will provide an update as soon as possible. Thank you for your patience.

Posted Jan 28, 2021 - 16:38 UTC
Investigating

We are currently investigating the following issue:

  • Component(s): Console API, Console UI
  • Impact(s):
    • Latency, timeouts and errors in API
    • Some users unable to log in
  • Severity: Major Outage
  • Datacenter(s): Global

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Jan 28, 2021 - 14:23 UTC