Root Cause Analysis
Duration of incident
Discovered: May 14, 2023, 01:35 - UTC
Resolved: May 15, 2023, 05:00 - UTC
Improved consolidation of devices in Auvik caused issues with Auvik’s ability to process data, causing processing to go offline on Cluster US2.
The additional processing caused the internal system for this service to go offline, causing a backlog of up to 22,000 devices not being able to be processed until the system was brought back online and the underlying
issue remediated.
All times UTC
15:06 - On-call engineer is alerted to a low disk space alert under Cluster US2 for device identification processing.
00:45 - On-call engineering was alerted to a crash loop with services for processing device identification on Cluster US2.
01:27 - An internal alert was triggered about excessive backlog in processing device identification on Cluster US2.
01:35 - An incident is declared, and additional engineering resources are called in.
01:35 - 4:00 - Engineering can identify the issue for delayed identification processing on Cluster US2 and create a workaround to drive identification lag down.
5:00 - Identification lag on Cluster US2 has been resolved. Device identification processing on Cluster US2 is processing as normal. The incident is marked as resolved.