Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On November 10, 00:55 UTC the imgix rendering service experienced a major outage affecting uncached image renders. Our engineers were able to trace the issue to unexpected side effects from an scheduled update to our infrastructure, which we then rolled back. The service was restored by 1:58 UTC with a very small percentage of customers experiencing slightly higher latency. The incident was marked as resolved at 2:17 UTC.

How were customers impacted?

During this incident, requests for some uncached derivative images received a 503 response. During this time, we saw an estimated 10% of all requests to the imgix service return a 503 error.

What went wrong during the incident?

An expected update to our infrastructure triggered the incident, from which we began to implement a rollback strategy to restore the service. The nature of the change was such that the issues did not manifest until the change was rolled out across our entire fleet. During the rollback, we ran into tooling limits for interacting with specific clusters in the imgix stack and for initiating batch changes, which slowed rollback progress. The service was restored immediately after the rollback was complete.

What will imgix do to prevent this in the future?

We will be revisiting our deployment patterns to ensure more consistent updates along with implementing better internal monitoring for deployments. We are also upgrading our tooling, which will improve our ability to quickly push updates across on our infrastructure along with providing additional failover functionality for redundancy.

Posted Nov 13, 2020 - 14:26 PST

Resolved
This incident has been resolved.
Posted Nov 10, 2020 - 18:17 PST
Monitoring
A fix has been rolled out and errors have returned to normal levels.
Posted Nov 10, 2020 - 17:58 PST
Update
We are continuing to roll out a fix. We will provide an additional update in the next 30 minutes if not sooner.
Posted Nov 10, 2020 - 17:45 PST
Identified
The issue has been identified and a fix is being implemented. We will provide an additional update in the next 30 minutes if not sooner.
Posted Nov 10, 2020 - 17:15 PST
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Nov 10, 2020 - 17:00 PST
This incident affected: Rendering Infrastructure.