Elevated Render Errors
Incident Report for imgix
Postmortem

What happened?

On June 10, 2021, between the hours of 1:50 UTC and 2:15 UTC, the rendering API experienced significant rendering errors for uncached derivative images. The issue was identified and fixed, though a small percentage (<.01%) of renders continued to return errors until another fix was pushed out at 2:54 UTC.

The incident was marked as fully resolved at 4:10 UTC.

How were customers impacted?

On June 10 between 1:50 UTC and 2:15 UTC, a significant amount of requests to uncached derivative images returned 503 errors. At its peak, 6% of all requests to imgix returned an error.

A fix began being implemented at 2:10 UTC and was fully rolled out by 2:15 UTC. Errors had returned to almost normal rates (<0.01%) after the time of the fix. A later patched restored the entirety of the service to normal at 2:54 UTC.

What went wrong during the incident?

Our engineers were alerted to an increasing amount of elevated error responses from an internal service. Investigating the issue, our engineers identified that a misconfiguration during routine network maintenance had caused a DNS-related failure within our infrastructure. During our investigation, we found that our failover systems had not mitigated the issue as expected.

Our engineers immediately corrected the misconfiguration and restored DNS, which restored the majority of service. After service was restored, our engineers detected rendering instability affecting a very small percentage of images. Our engineering team continued to investigate and was able to push out a fix by 2:54 UTC.

What will imgix do to prevent this in the future?

We will revisit current workflows and standard operating procedures to perform an architectural review of system dependencies.  In addition, imgix plans to improve coordination regarding scheduled maintenance to avoid service disruptions related to network changes.

Posted Jun 22, 2021 - 16:20 PDT

Resolved
No new errors have been reported since the fixes were applied. This incident has been completely resolved.
Posted Jun 09, 2021 - 21:09 PDT
Monitoring
The additional fixes have been implemented and service has returned to normal. We will continue to monitor the situation.
Posted Jun 09, 2021 - 19:54 PDT
Identified
Error rates are slowly increasing. Our engineers are implementing additional fixes.
Posted Jun 09, 2021 - 19:42 PDT
Monitoring
A fix has been implemented and service has returned to normal. We will continue to monitor the situation.
Posted Jun 09, 2021 - 19:16 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 09, 2021 - 19:08 PDT
Investigating
We are currently investigating elevated render error rates for images. We will update once when we obtain more information.
Posted Jun 09, 2021 - 18:58 PDT
This incident affected: Rendering Infrastructure.