TrafficInsights processing delayed in ca1.my
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - TrafficInsights not available on CA1 cluster

Root Cause Analysis

Duration of incident

Discovered: Oct 29, 2021 Time - 16:47 UTC
Resolved: Oct 29, 2021 Time - 21:30 UTC

Cause

A problematic flow from a device using the CA1 cluster for TrafficInsights caused the parser to crash and continually reboot.

Effect

TrafficInsights data processing was delayed on the CA1 cluster resulting in stale information being shown on the dashboard. The service disruption did not result in any data loss. All other services were unaffected.

Action taken

10/29/2021 - All times in UTC

16:42 TrafficInsights was enabled for a new device and a problematic flow was sent.
16:57 Auvik engineering team is made aware of the issue.
17:40 Auvik engineering team restarts the TrafficInsights service to see if the service will successfully bypass the problematic data causing the crash. The issue is not resolved.
18:00 Auvik engineering team determines the cause of the issue and proceeds to begin identification of the offending device.
19:05 Auvik identifies a specific device causing the service interruption.
19:40 Auvik finishes the code to ignore the device.
20:10 Auvik finishes the testing of code and introduces it into production.
20:16 Auvik increases the scales of data processing to quickly process the backlog.

20:42 Auvik confirms data is being processed correctly.
21:30 Auvik confirms the backlogged data has been fully processed with no data loss.

Future consideration(s)

Auvik will improve the ability to handle unexpected data or behavior in a graceful manner.
Auvik will plan to develop internal tooling to identify offending devices more quickly and efficiently.

Posted Nov 24, 2021 - 14:17 EST

Resolved
The backlog in ca1 has been processed and the service is operating normally. TrafficInsights dashboards should be up to date.
Posted Oct 29, 2021 - 17:27 EDT
Monitoring
A fix has been implemented. TrafficInsights has resumed processing data. We are monitoring its progress through the backlog.
Posted Oct 29, 2021 - 16:41 EDT
Update
The fix has been developed; we are deploying it to production now. We expect to resume processing data shortly.
Posted Oct 29, 2021 - 16:30 EDT
Identified
We have identified the issue and are developing a fix. If the fix is successful, we hope to resume processing data in the next hour.
Posted Oct 29, 2021 - 15:32 EDT
Investigating
Processing of TrafficInsights data has been delayed in ca1.my.auvik.com since 12:45pm Eastern Time. We are investigating.
Posted Oct 29, 2021 - 14:35 EDT
This incident affected: Auvik TrafficInsights.