Service Disruption - Device Discovery/Identification
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Delayed Device Identification on Cluster US2

Root Cause Analysis

Duration of incident

Discovered: May 14, 2023, 01:35 - UTC
Resolved: May 15, 2023, 05:00 - UTC

Cause

Improved consolidation of devices in Auvik caused issues with Auvik’s ability to process data, causing processing to go offline on Cluster US2.

Effect

The additional processing caused the internal system for this service to go offline, causing a backlog of up to 22,000 devices not being able to be processed until the system was brought back online and the underlying
issue remediated.

Action taken

All times UTC

05/14/2023

15:06 - On-call engineer is alerted to a low disk space alert under Cluster US2 for device identification processing.

05/15/2023

00:45 - On-call engineering was alerted to a crash loop with services for processing device identification on Cluster US2.

01:27 - An internal alert was triggered about excessive backlog in processing device identification on Cluster US2.

01:35 - An incident is declared, and additional engineering resources are called in.

01:35 - 4:00 - Engineering can identify the issue for delayed identification processing on Cluster US2 and create a workaround to drive identification lag down.

5:00 - Identification lag on Cluster US2 has been resolved. Device identification processing on Cluster US2 is processing as normal. The incident is marked as resolved.

Future consideration(s)

  • Auvik will improve processing identification failures to improve on-call engineers' response time.
  • Auvik will improve documentation in “run books” (repair code manuals) to improve the resolution time of incidents.
Posted May 25, 2023 - 10:07 EDT

Resolved
The fix for delayed Device Discovery has been implemented. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted May 15, 2023 - 01:07 EDT
Monitoring
We’ve identified the source of the service disruption and are monitoring the situation.
Posted May 15, 2023 - 00:12 EDT
Identified
We’ve identified the source of the service disruption with Device Discovery. New device discovery may be delayed on US2. We are working to restore service as quickly as possible.
Posted May 14, 2023 - 23:12 EDT
This incident affected: Network Mgmt (us2.my.auvik.com).