TrafficInsights processing delayed in US1, US3, US4
Incident Report for Auvik Networks Inc.
Postmortem

This is a Double Root Cause Analysis (RCA) document.

The first one is for the data delays in LNX, US1, US3, and US4.

The second is for the data delay and data loss on US4.

Service Disruption - Traffic Insights (TI) Data Delay LNX, US1, US3, and US4 Clusters

RCA

Duration of incident
Discovered: Apr 18, 2023 11:54 - UTC
Resolved: April 20, 2023 02:00 - UTC

Cause
A failure occurred with data processing for Traffic Insights (TI) data on Auvik clusters LNX, US1, US3 and US4.

Effect
There was a delay in processing TrafficInsights data while we intervened to restore the service. Significant amounts of data resulted in a longer restore time than anticipated.

Action taken
All times in UTC

04/18/2023

11:54 - Internal alerting notifies Auvik Engineering of excess TI data lag on the LNX cluster.

12:19 - Engineering begins troubleshooting issues with TI data lag in LNX.

12:35 - Engineering restarts TI data flow on LNX. Validates that data appears to be catching up as expected.

14:04 - 14:08 - Another internal alert notifies Engineering about excess TI data lag on LNX, US1, US3 and US4.

14:27 - Engineering begins troubleshooting issues with TI data lag.

15:58 - Auvik posts that Auvik is experiencing an Incident with TI data flow.

16:00 - 23:59 - Engineering reset data flow and proceeded to wait and observe until it started to catch up to the current TI data flow on the affected clusters.

4/19/2023

00:00-03:00 - Engineering continues to watch TI data flow catch up to the current TI data flow in cluster LNX, US1, US3 and US4.

03:04 - T1 data in LNX cluster catches up to current TI data. All data lag has stopped.

03:05 - The remaining TI data lag in US1, US3 and US4 does not appear to be catching up at a manageable rate to be able to be current in a timely manner.

03:10 - The data offset is altered on the restore points of the remaining TI data flows on US1, US3 and US4 clusters to remove redundant data from the restore process.

05:00 - 08:00 Engineering adds additional resources to the data flow restore process in order to allow the older TI data to be added without overextending resources on each cluster.

08:34 - All TI data lag in US3 is consumed. All TI data is current.

10:00 - All TI data lag in US1 is consumed. All TI data is current.

10:00 - TI Data lag under US4 cluster remains considerable. The remainder of the incident is recorded under Service Disruption - Traffic Insights (TI) Data Delay and Loss Cluster US4

Future consideration(s)

  • Auvik will make upgrades the technology that supports TrafficInsights data processing high availability
  • Auvik has improved its documentation on how to recover from data processing failures with traffic Insights
  • Auvik has improved its monitoring of high availability systems for TrafficInsights‌

**********************************************************************************************************************************

Service Disruption - Traffic Insights (TI) Data Delay and Data Loss on Cluster US4

RCA

Duration of incident
Discovered: Apr 18, 2023, 14:05- UTC
Resolved: Apr, 26, 2023, 13:06 - UTC
Cause
Traffic Insights (TI) data processing failure occurred on US4 cluster.
Effect
Despite automatic and manual remediation, TrafficInsights data processing continued to fail and fall behind on the US4 cluster. In order to restore services, a rollback was restored from two days prior. Most of the data was queued to processing, however, data from 22:00 UTC (EDT (6:00 PM) 04/20/2023 to 10:00 UTC (6:00 AM EDT) the following day was lost and remains irretrievable.

Action taken
04/18/2023 - 04/20/2023

11:54 - 10:00 - See Service Disruption - Traffic Insights (TI) Data Delay LNX, US1, US3 and US4 Clusters for information.

4/20/2023

10:00 - Data flow for TI fails again for US4 cluster.

10:05 - Data flow resumes for TI data on US4 cluster.

11:45 - It is decided to alternate the current TI data flow with the insertion of older TI data to complete the TI data catchup.

22:00 - Processing of current TI data to cluster us4 is placed on hold. Older TI data is processed into Auvik.

23:30 - Data flow of TI data is impacted by disk performance and slows to the point where it causes a failure for data insertion. The cause for the slowdown is unknown.

4/21/2023

02:00 - The cause of the disk performance failure is identified and steps are taken to mitigate the issue.

06:00 - The TI data inserter again fails and TI data processing is automatically set back to an earlier time.

10:00 - Processing of the old data is discontinued to allow the current data flow lag to catch up to actual time. Engineering decides to regroup on how to process interrupted data flow better. Efforts are put into place to attend to restarting efforts on Monday following a weekend regroup to validate the stability of the TI services on US4 cluster.

04/21/2023 - 04/24/2023

10:00 - 12:00 - The current TI data flow on US4 is monitored over the weekend and engineering meets to see how best to reinsert data from the 12 hours lost in the UI from alternating the old and current TI data flows.

4/24/2023

12:00 - It is discovered that the data retention flag for the data stored outside of the Auvik platform from 22:00 UTC (EDT (6:00 PM) 04/20/2023 to 10:00 UTC (6:00 AM EDT) was not set to be extended.

12:00 - 2:00 - All efforts to recover the overwritten TI data fail. The TI data was found to be overwritten and irretrievable by Auvik.

4/24/2023 - 04-26/2023

3:00 - Auvik stakeholders meet internally and formulate a plan to notify affected customers of the TI data loss and proceed to follow up with communication.

04/26/2023

13:06 - Auvik officially closes the incident on Auvik’s status page.

Future consideration(s)

  • Auvik has improved the current monitoring of its database and created more for TrafficInsights.
  • Auvik has improved its monitoring of the underlying services for Traffic Insights.
  • Auvik will improve its storage disk types for TrafficInsights to improve performance.
  • Auvik will be investigating the better distribution of client tenants across Auvik clusters.
  • Auvik is reviewing on-call procedures to address deficiencies under long-running incidents.
Posted May 15, 2023 - 09:32 EDT

Resolved
After further investigation, we have determined that a portion of the Traffic Insights (TI) data for the US4 cluster has become irretrievable. The TI data loss was limited to 6 PM EDT Thursday, April 20, and 6 AM EDT Friday, April 21. While impactful, this loss was minimized to a 12-hour time frame that took place outside of typical business hours. This issue only affected accounts that had TI turned on at the time of the outage and had no other operational impact on the core functions of Auvik deployments. Please contact your Customer Success Manager if your account was affected and you would like additional detail on the outage.

A full Root Cause Analysis (RCA) will be completed and made available in the coming weeks.
Posted Apr 26, 2023 - 09:30 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights (TI) in US4. We continue to work through the issue to bring all TI flow data to customers. Current TI flow data is available in the UI. We appreciate your patience as we continue to work through this issue. We will keep you posted on a resolution. The next update will be by April 25th, 18:00 UTC.
Posted Apr 24, 2023 - 13:57 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights (TI) in US4. We continue to work through the issue to bring all TI flow data to customers. Current TI flow data is available in the UI. We appreciate your patience as we continue to work through this issue. We will keep you posted on a resolution. The next update will be by 18:00 UTC.
Posted Apr 24, 2023 - 06:34 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights (TI) in US4. We continue to work through the issue to bring all TI flow data to customers. Current TI flow data is available in the UI. We appreciate your patience as we continue to work through this issue. We will keep you posted on a resolution. The next update is scheduled for Monday, 24, 2023, at 14:00 UTC.
Posted Apr 21, 2023 - 11:22 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights (TI) in US4. We continue to work through the issue to bring all TI flow data to customers. Current TI flow data is available in the UI at this time. We appreciate your patience as we continue to work through this issue. We will keep you posted on a resolution. The next update will be by 16:00 UTC.
Posted Apr 21, 2023 - 08:01 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights in US4. We continue to work through the issue to bring data current for all customers. Some customers may already see current data, but these results are still inconsistent across the cluster for all customers. We still anticipate having full functionality restored to the TI setup at some point tomorrow afternoon. We appreciate your patience as we continue to work through this large delay in full functionality. We will keep you posted on a resolution. The next update will be tomorrow morning.
Posted Apr 20, 2023 - 15:56 EDT
Update
We have identified the underlying, root cause of the instability that has caused the delay in the results of processing data in TrafficInsights in US4. We continue to work through the issue to bring data current for all customers. Some customers may already see current data, but these results are still inconsistent across the cluster for all customers. We anticipate having full functionality restored at some point tomorrow afternoon. We appreciate your patience as we continue to work through this delay. We will keep you posted on a resolution. Next update by 21:00 UTC.
Posted Apr 20, 2023 - 09:28 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights in US4. We are taking action to address the issue. Some customers may already see up to date data, but it will be inconsistent across the cluster. We continue to monitor and work toward a successful conclusion. Overnight progress was good and continues into this morning. We appreciate your patience as we continue to work through this delay. We will keep you posted on a resolution. Next update by 14:00 UTC
Posted Apr 20, 2023 - 06:14 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights in US4. We are taking action to address the issue. Some customers may already see up to date data, but it will be inconsistent across the cluster. We continue to monitor and work toward a successful conclusion overnight . We expect our next update tomorrow morning based on progress overnight. We appreciate your patience as we continue to work through this delay. We will keep you posted on a resolution.
Posted Apr 19, 2023 - 20:21 EDT
Update
We have identified the underlying root cause of the instability that has caused the delay in the results of processing data in TrafficInsights in US4. We are taking action to address the issue. Some customers may already see up to date data, but it will be inconsistent across the cluster. We continue to monitor the situation and work toward a successful conclusion. We appreciate your patience as we continue to work through this delay. We will keep you posted on a resolution.
Posted Apr 19, 2023 - 16:15 EDT
Update
We are continuing to monitor the situation as we work to resolve the delayed results in TrafficInsights in US4. Some customers may already see up to date data, but we continue to work to restore data across all tenants. We appreciate your patience as we continue to work through this delay. We will post another update by 21:00 UTC
Posted Apr 19, 2023 - 12:59 EDT
Update
We are continuing to monitor the situation as we work to resolve the delayed results in TrafficInsights in US4. Some customers may already see up to date data, but we continue to work to restore data across all tenants. We appreciate your patience as we continue to work through this delay. We will post another update by 17:00 UTC
Posted Apr 19, 2023 - 10:09 EDT
Update
We are continuing to monitor the situation as we work to resolve the delayed results in TrafficInsights in US4. Some customers may already see up to date data, but we continue to work to restore data across all tenants. We appreciate your patience as we continue to work through this delay. We will keep you posted on a resolution.
Posted Apr 19, 2023 - 07:05 EDT
Update
We are continuing to monitor the situation as we resolve the delayed results in TrafficInsights in US4. Some customers may already see up to date data but we continue to work to restore data across all tenants.
Posted Apr 19, 2023 - 03:21 EDT
Update
We are continuing to monitor the situation as we resolve the delayed results in TrafficInsights in US4. Some customers may already see up to date data but we continue to work to restore data across all tenants.
Posted Apr 18, 2023 - 23:25 EDT
Update
Issues with US1 and US3 are resolved. We are continuing to monitor US4 as it recovers.
Posted Apr 18, 2023 - 18:20 EDT
Update
Issues with TrafficInsights in US3 are now resolved. We expect that issues in US1 will be resolved within 1 hr. US4 is continuing to improve but we are still refining our estimate for when it will be fully functioning. We'll continue to keep you posted as this progresses.
Posted Apr 18, 2023 - 17:03 EDT
Update
We’ve identified the source of the service disruption with TrafficInsights and continue to monitor the situation. We are now starting to see data process in US4 and expect further improvements in US1 & US3 shortly. We expect to return to fully functioning across all clusters by end of day. We’ll keep you posted on a resolution.
Posted Apr 18, 2023 - 13:42 EDT
Monitoring
Starting at 10:04am ET a delay in processing TrafficInsight data occurred in US1, US3, US4. No data has been lost, but results will be delayed by about 90 minutes. A fix has already been deployed, but it will take some time to process the backlog of messages. We are monitoring results as the system recovers.
Posted Apr 18, 2023 - 11:58 EDT
This incident affected: Auvik TrafficInsights.