Slow renders
Incident Report for imgix
Postmortem

What happened?

On May 18, 2021, 12:54 UTC, the imgix service experienced disruption caused by long-running processes within our origin cache. Once our engineers identified the issue, they began to implement remediation by 13:07 UTC. Error rates began subsiding by 13:27 UTC with full restoration of the service by 14:20 UTC. 

Hours later on May 19, 2021, at 2:24 UTC, imgix experienced another issue with some slow renders and timeouts. Ongoing work from the earlier incident interrupted the service’s typical automatic recovery and required manual intervention. Progress began at 3:15 UTC with service being fully restored by 3:55 UTC.

How were customers impacted?

During the first incident, customers may have noticed some uncached derivative images return an error. 

During the subsequent incident, some uncached derivatives took longer than normal to render, with some requests timing out.

In both events, cached derivative images were not impacted and continued to be served as normal. 

What went wrong during the incident?

Our engineers were alerted to an increasing amount of elevated error responses from our service. Investigating the issue, our engineers identified a bottleneck in our origin cache. Our engineers isolated the issue and implemented limits to prevent further issues from stalling again after recovery.

After this incident subsided, we picked up slowness in image rendering which eventually culminated into timeouts for some requests. Manual intervention to restart some components was required in a later incident. A combination of rate-limiting and component restarts aided service recovery.

What will imgix do to prevent this in the future?

We will continue to fine-tune our tooling to detect and isolate problems before they can trigger larger failures. We will also be implementing changes targeting unexpected origin behavior observed during the time of the incident.

Posted May 26, 2021 - 12:39 PDT

Resolved
This incident has been resolved.
Posted May 18, 2021 - 22:22 PDT
Update
The fix for slow renders is continuing to roll out and we are seeing response times improve. We will continue the rollout and update again once we have more information.
Posted May 18, 2021 - 21:45 PDT
Update
Error rates have returned to normal levels. The engineering team is continuing to work on slow renders. Previously cached derivatives are not impacted.
Posted May 18, 2021 - 21:07 PDT
Identified
The issue has been identified and our engineering team is developing a fix.
Posted May 18, 2021 - 20:18 PDT
This incident affected: Rendering Infrastructure.