On May 15, 2020 at 02:25 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for uncached derivative images. The imgix engineering team implemented remediations that restored normal service by 03:05 UTC.
During the period of the incidents, customers may have noticed some uncached derivative images return an error. We saw up to 15% of requests fail to return successfully.
Cached derivative images were not impacted and continued to be served as normal.
The issue was quickly identified as slow origins impacting the service. Previous remediations which had already put into place were expected to enable the system to recover on its own and we have seen the system do so in similar circumstances.
This time, however, an older configuration had been loaded onto some servers which caused the servers to be unable to recover on their own. The engineering team had to manually intervene to restore each server to a good state and allow the system to recover.
What will imgix do to prevent this in the future?
This incident exposed a weak spot in our infrastructure that did not already have instant rollbacks, which we will be addressing immediately.
Automated slow origin detection and rate limiting has already been deployed to isolate the impact and additional capacity has been added to accommodate general traffic increases.