Duration of incident
Discovered: Apr 18, 2023 11:54 - UTC
Resolved: April 20, 2023 02:00 - UTC
Cause
A failure occurred with data processing for Traffic Insights (TI) data on Auvik clusters LNX, US1, US3 and US4.
Effect
There was a delay in processing TrafficInsights data while we intervened to restore the service. Significant amounts of data resulted in a longer restore time than anticipated.
Action taken
All times in UTC
04/18/2023
11:54 - Internal alerting notifies Auvik Engineering of excess TI data lag on the LNX cluster.
12:19 - Engineering begins troubleshooting issues with TI data lag in LNX.
12:35 - Engineering restarts TI data flow on LNX. Validates that data appears to be catching up as expected.
14:04 - 14:08 - Another internal alert notifies Engineering about excess TI data lag on LNX, US1, US3 and US4.
14:27 - Engineering begins troubleshooting issues with TI data lag.
15:58 - Auvik posts that Auvik is experiencing an Incident with TI data flow.
16:00 - 23:59 - Engineering reset data flow and proceeded to wait and observe until it started to catch up to the current TI data flow on the affected clusters.
4/19/2023
00:00-03:00 - Engineering continues to watch TI data flow catch up to the current TI data flow in cluster LNX, US1, US3 and US4.
03:04 - T1 data in LNX cluster catches up to current TI data. All data lag has stopped.
03:05 - The remaining TI data lag in US1, US3 and US4 does not appear to be catching up at a manageable rate to be able to be current in a timely manner.
03:10 - The data offset is altered on the restore points of the remaining TI data flows on US1, US3 and US4 clusters to remove redundant data from the restore process.
05:00 - 08:00 Engineering adds additional resources to the data flow restore process in order to allow the older TI data to be added without overextending resources on each cluster.
08:34 - All TI data lag in US3 is consumed. All TI data is current.
10:00 - All TI data lag in US1 is consumed. All TI data is current.
10:00 - TI Data lag under US4 cluster remains considerable. The remainder of the incident is recorded under Service Disruption - Traffic Insights (TI) Data Delay and Loss Cluster US4
Future consideration(s)
**********************************************************************************************************************************
Duration of incident
Discovered: Apr 18, 2023, 14:05- UTC
Resolved: Apr, 26, 2023, 13:06 - UTC
Cause
Traffic Insights (TI) data processing failure occurred on US4 cluster.
Effect
Despite automatic and manual remediation, TrafficInsights data processing continued to fail and fall behind on the US4 cluster. In order to restore services, a rollback was restored from two days prior. Most of the data was queued to processing, however, data from 22:00 UTC (EDT (6:00 PM) 04/20/2023 to 10:00 UTC (6:00 AM EDT) the following day was lost and remains irretrievable.
Action taken
04/18/2023 - 04/20/2023
11:54 - 10:00 - See Service Disruption - Traffic Insights (TI) Data Delay LNX, US1, US3 and US4 Clusters for information.
4/20/2023
10:00 - Data flow for TI fails again for US4 cluster.
10:05 - Data flow resumes for TI data on US4 cluster.
11:45 - It is decided to alternate the current TI data flow with the insertion of older TI data to complete the TI data catchup.
22:00 - Processing of current TI data to cluster us4 is placed on hold. Older TI data is processed into Auvik.
23:30 - Data flow of TI data is impacted by disk performance and slows to the point where it causes a failure for data insertion. The cause for the slowdown is unknown.
4/21/2023
02:00 - The cause of the disk performance failure is identified and steps are taken to mitigate the issue.
06:00 - The TI data inserter again fails and TI data processing is automatically set back to an earlier time.
10:00 - Processing of the old data is discontinued to allow the current data flow lag to catch up to actual time. Engineering decides to regroup on how to process interrupted data flow better. Efforts are put into place to attend to restarting efforts on Monday following a weekend regroup to validate the stability of the TI services on US4 cluster.
04/21/2023 - 04/24/2023
10:00 - 12:00 - The current TI data flow on US4 is monitored over the weekend and engineering meets to see how best to reinsert data from the 12 hours lost in the UI from alternating the old and current TI data flows.
4/24/2023
12:00 - It is discovered that the data retention flag for the data stored outside of the Auvik platform from 22:00 UTC (EDT (6:00 PM) 04/20/2023 to 10:00 UTC (6:00 AM EDT) was not set to be extended.
12:00 - 2:00 - All efforts to recover the overwritten TI data fail. The TI data was found to be overwritten and irretrievable by Auvik.
4/24/2023 - 04-26/2023
3:00 - Auvik stakeholders meet internally and formulate a plan to notify affected customers of the TI data loss and proceed to follow up with communication.
04/26/2023
13:06 - Auvik officially closes the incident on Auvik’s status page.
Future consideration(s)