Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On March 2, 19:50 UTC the imgix rendering service experienced network instability which triggered an outage affecting some uncached image renders. Mitigations were implemented, which enabled the service to begin recovery by 20:10 UTC.

How were customers impacted?

During this incident, requests for some uncached derivative images received error responses. Approximately 3.5% of requests returned an error during the peak of the incident between 19:50 UTC and 20:10 UTC, with service being completely restored to the majority of customers by 20:11 UTC. The incident was marked as fully resolved by 22:47 UTC.

What went wrong during the incident?

Our engineers were alerted to an increasing amount of errors generated from our rendering stack. The cause was due to a brief spate of network instability which eventually culminated into cascading failures across our origin cache. Our engineers then identified the cause of the failures and applied mitigations using new tooling, which minimized the duration and effect of system failures on imgix traffic.

What will imgix do to prevent this in the future?

While our recovery was swift thanks to newly implemented tooling, there are a few improvements that we will be making to our incident runbooks and processes so that we improve response times to incident alerts. We will also improve monitoring of network connectivity and implement tooling to enable us to rapidly shift traffic to alternate paths in the event of network instability.

Posted Mar 05, 2021 - 13:48 PST

Resolved
Service has been completely restored.
Posted Mar 02, 2021 - 14:47 PST
Monitoring
Our engineering team has applied a fix, restoring services for all customers. We are currently monitoring the situation.
Posted Mar 02, 2021 - 14:05 PST
Update
Service has been restored for most affected customers. Our engineering team is continuing to investigate individual cases of service degradations.

Previously cached derivatives are not impacted.
Posted Mar 02, 2021 - 12:54 PST
Update
Errors are significantly down but not yet back to normal levels. Our engineering team is continuing to deploy mitigations.
Posted Mar 02, 2021 - 12:51 PST
Identified
The issue has been identified and our engineering team is deploying a fix.
Posted Mar 02, 2021 - 12:13 PST
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Mar 02, 2021 - 11:54 PST
This incident affected: Rendering Infrastructure.