API Calls Intermittently Returning 403 Errors
Incident Report for Xandr
Postmortem

Incident Summary

On Tuesday, Nov 11, 2021 at approximately 16:43 UTC, a combination of load balancing issues and a bug tied to authorization caused intermittent 403s, causing API calls to fail in the fetching of brand data and impacting user workflow.

Scope of Impact

Users may have experienced intermittent 403 errors, which affected API calls to retrieve brand data.

Timeline (UTC)

2021-11-23 16:43 Incident Started: Customer reports of receiving 403s when calling /brand endpoint.
2021-11-24 14:48 Incident Ticket Created
2021-11-24 16:48 Escalated to Engineering
2021-11-24 16:58 Load balancing issue identified
2021-11-24 17:29 Mitigation of unbalanced traffic completed
2021-11-24 23:08 Bug in logic handling database cache refresh identified
2021-11-26 19:07 Load balancing issues determined to be underlying cause. Additional hardware brought online to address
2021-11-30 14:11 Additional client reports of intermittent errors
2021-12-03 17:27 Incident Resolved: Intermittent 403's mitigated

Cause Analysis

The incident was caused by an API bug impacting authorization in conjunction with a load balancing issue.

Resolution Steps

Our engineers resolved the issue by implementing a fix to the authorization service. Increasing bandwidth in the AMS datacenter helped mitigate the issue further. Additional research into properly load balancing services is ongoing as of now.

Next Steps

  • Continue monitoring the issue
  • Implement more robust alerting for 40* errors from internal applications
Posted Dec 17, 2021 - 18:08 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Dec 06, 2021 - 09:35 UTC
Identified

We have identified the cause of the issue, and our engineers are actively working towards a resolution. We will provide an update as soon as possible. Thank you for your patience.

Posted Nov 28, 2021 - 17:15 UTC
Investigating

We are currently investigating the following issue:

  • Component(s): API
  • Impact(s):
    • Latency, timeouts and errors in API
  • Severity: Partially Degraded
  • Datacenter(s): FRA1

We will provide an update as soon as more information is available. Thank you for your patience.

Posted Nov 28, 2021 - 16:12 UTC