Spreedly logo
  • Operational
  • Degraded Performance
  • Partial Outage
  • Major Outage
  • Maintenance
Spreedly API Errors
Incident Report for Spreedly
Postmortem

November 5, 2020 — Intermittent 500 responses from Core due to internal dependency failing AWS ELB health checks

From 2021-11-05 1:00 AM UTC until 1:10 AM UTC requests to the Spreedly Core API intermittently returned 500 error response codes for all API request types. Approximately 10,000 requests received a 500 error response during this time.

What Happened

On 11/05/2021 at 1:02AM UTC Spreedly’s internal monitoring detected an elevated number of error responses being returned from the Spreedly Core API. Engineers were paged and began investigating. An internal system dependency was identified as partially unavailable beginning at 1:00AM UTC. An automated antivirus scan that runs on the dependent system resulted in constrained resources on a subset of hosts, resulting in those hosts being marked as “unhealthy” by an automated health check process, and they were subsequently removed from service. New hosts were automatically brought into service and normal operations resumed at approximately 1:08 AM UTC on 11/05/2021.

Next Steps

Spreedly engineers have made changes to mitigate the effects of the automated antivirus scan such that it should no longer cause the system to become unresponsive.

Posted Nov 10, 2021 - 16:24 EST

Resolved
After deploying the fix, all systems appear to be stabilized and functioning. The incident is being considered resolved.

We are still investigating to understand the specific causes of the incident and any residual impact. A post incident review will be published.

We apologize for any inconvenience and disruption to service.
Posted Nov 05, 2021 - 22:51 EDT
Investigating
We have identified an issue causing intermittent 500 errors on Spreedly's Core API.

This began impacting a small subset of transactions and requests to Spreedly's API at UTC 0:00 hours.

The 500 errors caused a brief service degradation between UTC 0:00 hours and at UTC 0:07 hours.

We are actively investigating the root cause.

Updates will be provided as they become available.
Posted Nov 05, 2021 - 21:50 EDT
This incident affected: Core Transactional API.