API Calls Intermittently Returning 502, 503, and 504 Errors
Incident Report for Xandr
Postmortem

Incident Summary
On Friday, November 26th, there was a spike in traffic in the AMS1 data center. By Sunday, November 28th, traffic reached a level that caused an increase in intermittent 502, 503, and 504 error responses for API requests to some services.

Scope of Impact
During the incident window some API services returned intermittent 502, 503, and 504 errors. These errors caused workflow disruptions and required some API calls be made multiple times to get a 200 response.

Timeline (UTC)
2021-11-26 22:00:00: Incident Started
2021-11-28 10:00:00: Issue was reported and investigation begun, issue was originally thought to be related another open incident
2021-11-28 15:52:00: Issue was determined to be unrelated to the other incident and a new incident was created
2021-11-28 16:09:00: Internal escalation to engineering
2021-11-29 2:36:00: Cause of incident discovered and addressed
2021-11-29 9:00:00: Incident resolved

Cause Analysis
This increase was the result of a reoccurring workflow being run by 14 user seats instead of the single instance previously used. This caused an overload that resulted in the intermittent 502, 503, and 504 errors.

Resolution Steps
The problem workflow was reverted, and a new load balancing solution was implemented to ease the increased traffic.

Next Steps

  • Add new alerts for 502, 503, and 504 errors to detect the problem earlier in the future
Posted Dec 17, 2021 - 22:08 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Dec 01, 2021 - 17:10 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Nov 29, 2021 - 15:58 UTC
Investigating

We are currently investigating the following issue:

  • Component(s): API
  • Impact(s):
    • Latency, timeouts and errors in API
  • Severity: Partially Degraded
  • Datacenter(s): FRA1

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Nov 28, 2021 - 17:14 UTC