At 09:54 UTC on 2021-09-18, Spreedly detected an increase in 500 response codes being returned to customers. Spreedly immediately investigated the issue, and identified that a portion of requests for one instance of the Core API service were returning 500 errors due to DNS lookup failures to our service provider. DNS service issues corrected for this host by 10:18:00 UTC, and the service resumed normal operation.
At 09:54 UTC on 2021-09-18 Internal monitoring detected an unusual number of errors being returned by the Core API service to customers, and engineers were paged and began investigation. The issue was isolated to a subset of requests to a single host in the cluster, which resolved without intervention at 10:18:00 UTC. Over the course of this 24 minute degradation of service approximately 7% of all requests were affected by DNS lookup failures to Spreedly’s upstream DNS service.
Spreedly will pursue additional monitoring and resiliency improvements to DNS services within the API environment with a goal of reducing recovery time in the face of upstream service failures.