Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On December 17, 2021, 05:06 UTC, some uncached requests to the imgix service began to return a 503 response. 

By 05:36 UTC the issue had been completely resolved.

How were customers impacted?

Between the hours of 05:01 UTC and 05:36 UTC, some requests to non-cached derivative images began to return a 503 error, with a 10% peak error rate being reached for parts of the incident.

At 5:07 UTC, error rates began to decrease slowly, though a 5% error rate persisted until a fix was pushed at 5:36, which completely restored the service.

What went wrong during the incident?

Large unexpected traffic patterns triggered a problematic interaction with a newly built internal automation, causing the initial incident.

Our team pushed mitigations early on in the incident, though the mitigations had further unexpected interactions with the newly built automation. While the service did begin to recover, the rate of recovery was slower than expected due to these interactions.

Once the interaction was identified, another manual change was made which completely restored the service.

What will imgix do to prevent this in the future?

We will be adding additional tooling which will enable us to more quickly identify proximate causes during incidents. We will also internally document the interactions and behaviors of our existing automation and mitigation runbooks to ensure smoother recovery times in the future. We also identified some improvement opportunities for some of our existing automation, which have completed fine-tuning.

Posted Dec 22, 2021 - 11:19 PST

Resolved
This incident has been resolved.
Posted Dec 16, 2021 - 21:58 PST
Update
This incident has been resolved.
Posted Dec 16, 2021 - 21:58 PST
Monitoring
A fix has been implemented and error rates have returned to normal. We are monitoring the situation.
Posted Dec 16, 2021 - 21:42 PST
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Dec 16, 2021 - 21:33 PST
This incident affected: Rendering Infrastructure.