On November 10, 00:55 UTC the imgix rendering service experienced a major outage affecting uncached image renders. Our engineers were able to trace the issue to unexpected side effects from an scheduled update to our infrastructure, which we then rolled back. The service was restored by 1:58 UTC with a very small percentage of customers experiencing slightly higher latency. The incident was marked as resolved at 2:17 UTC.
During this incident, requests for some uncached derivative images received a 503
response. During this time, we saw an estimated 10% of all requests to the imgix service return a 503
error.
An expected update to our infrastructure triggered the incident, from which we began to implement a rollback strategy to restore the service. The nature of the change was such that the issues did not manifest until the change was rolled out across our entire fleet. During the rollback, we ran into tooling limits for interacting with specific clusters in the imgix stack and for initiating batch changes, which slowed rollback progress. The service was restored immediately after the rollback was complete.
We will be revisiting our deployment patterns to ensure more consistent updates along with implementing better internal monitoring for deployments. We are also upgrading our tooling, which will improve our ability to quickly push updates across on our infrastructure along with providing additional failover functionality for redundancy.