Delay in Dashboard Analytics
Incident Report for imgix
Postmortem

What happened?

On September 01, 2021, 15:33 UTC, analytics and logging for imgix usage had abruptly stopped. During this time, no customer analytics was recorded. This includes data related to image bandwidth, Origin Image counts, and other usage data typically generated from image requests. The issue went unnoticed until the next day on September 02, 15:44 UTC, when a fix was pushed to immediately resume logging.

How were customers impacted?

Customers lost approximately 23 hours of imgix analytics data, though we were able to completely recover Origin Image counts. The affected time range for missing analytics spans from September 01, 2021, 15:33 UTC to September 02, 2021, 15:44 UTC. 

In the dashboard, this is represented as dramatically lower bandwidth counts for the dates between 09/01/2021 and 09/02/2021. All other analytics data (such as network usage, audience analytics, network health, etc.) will also show data missing during that time period.

What went wrong during the incident?

On September 01, 15:33 UTC a breaking change was deployed by our engineering team which had affected data logging in imgix. This change had been tested prior to being pushed to production, though we lacked monitoring on key measurements that would have let us catch the issue before going live with the change. Consequently, the issue went unnoticed until the next day, when one of our staff members had noticed that analytics was not reporting any data in the dashboard.

Once the issue was identified, our engineers rolled back to restore logging functionality. While we were able to recover Origin Image counts, most of the other analytical data (bandwidth, audience analytics, network logs) were lost during the logging outage.

What will imgix do to prevent this in the future?

On the monitoring side, we will implement monitoring to track metrics such as bandwidth and usage data to trigger internal alerts when data deviates greatly. These changes will be implemented across all applicable systems. 

We will also be updating our tooling to allow us to recover and replay data in the event that usage logging is disrupted.

Posted Sep 20, 2021 - 09:33 PDT

Resolved
Dashboard analytics are now correctly reflecting usage. Analytics numbers during the degraded performance period may not fully reflect actual usage.
Posted Sep 02, 2021 - 16:37 PDT
Update
New analytics are now populating in Dashboard analytics. We are still working on recovering missing analytics.
Posted Sep 02, 2021 - 14:20 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Sep 02, 2021 - 10:44 PDT
Investigating
We are currently investigating reports of analytics in user Dashboards being delayed for Sept 2. We will update when we have more information.

The rendering service is not impacted.
Posted Sep 02, 2021 - 10:33 PDT
This incident affected: Web Administration Tools.