Performance Issue - Data for some new devices is delayed

Incident Report for Auvik Networks Inc.

Postmortem

Performance Disruption - Delay with updating in some Auvik clusters

Root Cause Analysis

Duration of incident

Discovered: Feb 8, 2023, 23:05 - UTC
Resolved: Feb 13, 2023, 21:40 - UTC

Cause

During a routine database cleanup, a row in a table was unintentionally altered that caused a reload of the entire table instead of updating the specified row. This change caused an unexpectedly high load on the system resources provisioned for new devices.

Effect

Slowed down by the excessive load on the system for the processing of new devices, which resulted in a delay in customers being able to map and monitor these devices. The delay for most affected clients was up to a few hours but ranged up to several days in the most extreme cases. All existing devices were unaffected and were monitored normally during the incident.

Action taken

All times in UTC

02/08/2023

23:05 - Engineering is alerted to a possible SNMP poller issue with clients and begins an investigation.

02/09/2023

01:43 - An incident is declared, and engineering resources are called in to assist with the current incident.
02:03 - Engineering begins to identify the issue and its extent. It commences the investigation for a root cause. A few affected resources are increased to increase data flow.

02:54 - The first discovery of the hierarchy service impact (Parent to child site relationship.) is noticed and investigated.
03:02 - Engineering concluded that the most efficient and safe way to work through the additional data processing without possible data, the loss would be to strategically increase resources to allow the system to process the data.
03:20 - All lag seems to have recovered except for four clusters; AU1, US1, US3, and US4.
05:15 - The lag in AU1 has caught up, and data is now flowing normally. US1 and the US3 lag is holding steady while the US4 lag is recovering.
09:25 - Cluster US4 has recovered.
10:25 - Cluster US3 has recovered. US1 is recovering but slowly.

‌

02/09/2023- 02/13/2023

10:25 - 13:00 - Engineering continues work on the data delay on US1 with tweaks and adjustments to allow the additional data to process while continuing to allow new data into the system. US1 improvements continue with most tenants. There are still several individual sites that do not show improvement, so progress for all tenants on US1 is not simultaneous. The work continued over the scheduled maintenance window. The maintenance window doesn’t affect the data recovery timeline.

02/13/2023

13:00 - The delay under US1 has leveled out and does not seem to improve. Only three partitions left of the original 24 still need to catch up. The decision is made to upgrade the disks processing the data to increase the capacity to allow the data delay to recover completely.
14:00 - 17:00 - The disks are upgraded and placed into the production environment.

17:10 - Lag again begins to drop over the remaining three partitions.
21:40 - The remaining data processing lag has recovered, and systems are processing normally. Auvik’s status page is updated to reflect its status.

Future consideration(s)

● Standard Code review was not followed before the code was introduced to the system. Auvik will review and better enforce
in-place standards with engineering team members.
● Auvik will enforce standard waiting periods for code deployments.
● Auvik will review current alerting to see what can be improved and possible circuit breakers to be employed if required.
● A review of the affected systems and services will be done to remove bottlenecks identified in this incident to prevent a prolonged recovery window as was experienced here.
● Auvik will thoroughly review current hardware to see where updates/upgrades can be made to affect performance positively.

Posted 2 years ago. Feb 23, 2023 - 13:21 EST

Resolved

The solution for the Performance Issue (Data for some new devices is delayed) has been implemented. The source of the performance impact has been addressed, and performance should again be optimal.

A Root Cause Analysis (RCA) will follow after completing a full review.

Posted 2 years ago. Feb 13, 2023 - 16:50 EST

Update

We continue to experience delays in processing data for a small number of clients for some newly discovered devices on Cluster US1. The processing delay persists but continues a consistent path to recovery. There is no sign of data loss. We will continue to monitor its progress.

Posted 2 years ago. Feb 13, 2023 - 16:17 EST

Update

Posted 2 years ago. Feb 13, 2023 - 13:18 EST

Update

Posted 2 years ago. Feb 13, 2023 - 09:23 EST

Update

Posted 2 years ago. Feb 12, 2023 - 21:51 EST

Update

Posted 2 years ago. Feb 12, 2023 - 15:22 EST

Update

Posted 2 years ago. Feb 12, 2023 - 09:51 EST

Update

Posted 2 years ago. Feb 11, 2023 - 21:55 EST

Update

Posted 2 years ago. Feb 11, 2023 - 15:02 EST

Update

Posted 2 years ago. Feb 11, 2023 - 11:52 EST

Update

We are continuing to monitor for any further issues.

Posted 2 years ago. Feb 11, 2023 - 09:04 EST

Update

We are continuing to monitor for any further issues.

Posted 2 years ago. Feb 11, 2023 - 08:58 EST

Update

We continue to experience delays in processing data for some newly discovered devices on Cluster US1. The processing delay persists but is on a consistent path to recovery. There is no sign of data loss. We continue to monitor its progress. ETA for completion is sometime around 02/11/2023's Maintenance window. The next update will coincide with the Maintenance window announcements unless there are any relevant changes to report.

Posted 2 years ago. Feb 10, 2023 - 13:29 EST

Update

We are continuing to monitor for any further issues.

Posted 2 years ago. Feb 10, 2023 - 12:47 EST

Update

We continue to experience delays in processing data for some newly discovered devices on Cluster US1. The processing delay persists but is on a consistent recovery path. There is no sign of data loss. We will continue to monitor its progress throughout the day. The next update will be by 19:00 UTC unless there are any relevant changes to report.

Posted 2 years ago. Feb 10, 2023 - 09:39 EST

Update

We continue to experience delays in processing data for some newly discovered devices on Cluster US1. The processing delay persists but is continuing to improve steadily. There is no sign of data loss. We will continue to monitor its progress throughout the evening. We will update this post by 02/10/2023, 14:00 UTC if there are no relevant changes before that time.

Posted 2 years ago. Feb 09, 2023 - 15:56 EST

Update

We continue to experience delays in processing data for some newly discovered devices on Cluster US1. The processing delay persists but is steadily improving. We continue to monitor it. There is no sign of data loss. We’ll update this post by 21:00 UTC.

Posted 2 years ago. Feb 09, 2023 - 13:28 EST

Update

We continue to experience delays in processing data for some newly discovered devices on Cluster US1. We continue to see improvements with the delay. We continue to monitor it. There is no sign of data loss. We’ll keep you posted on a resolution.

Posted 2 years ago. Feb 09, 2023 - 12:12 EST

Update

Posted 2 years ago. Feb 09, 2023 - 11:04 EST

Update

We continue to experience delays in processing data for some newly discovered devices on Cluster US1. We see improvements with the delay. We continue to monitor it. There is no sign of data loss. We’ll keep you posted on a resolution.

Posted 2 years ago. Feb 09, 2023 - 09:32 EST

Update

We continue to experience delays in processing data for some newly discovered devices on Cluster US1. We continue to monitor it. There is no sign of data loss. We’ll keep you posted on a resolution.

Posted 2 years ago. Feb 09, 2023 - 08:01 EST

Update

We continue to experience delays in processing data for some newly discovered devices. US cluster 1 is still recovering. We continue to monitor it. There is no sign of data loss. We’ll keep you posted on a resolution.

Posted 2 years ago. Feb 09, 2023 - 07:00 EST

Update

We continue to experience delays in processing data for some newly discovered devices. Delays in processing for clients in US clusters 3 and 4 have caught up to regular levels. US cluster 1 is also recovering but more slowly. We continue to monitor it. There is no sign of data loss. We’ll keep you posted on a resolution.

Posted 2 years ago. Feb 09, 2023 - 06:04 EST

Update

We continue to experience delays in processing data for some newly discovered devices. We persist in our work on the remaining clusters affected. There is no sign of data loss. We’ll keep you posted on a resolution

Posted 2 years ago. Feb 09, 2023 - 05:04 EST

Update

We’ve identified the source of the service disruption causing delays in processing data for some newly discovered devices. We have have implemented a fix and are monitoring the situation.

Posted 2 years ago. Feb 09, 2023 - 00:53 EST

Monitoring

We’ve identified the source of the service disruption causing delays in processing data for some newly discovered devices. We have have implemented a fix and are monitoring the situation.

Posted 2 years ago. Feb 08, 2023 - 23:02 EST

Identified

We’ve identified the source of the service disruption with processing data for newly discovered devices. We are working to restore service as quickly as possible.

Posted 2 years ago. Feb 08, 2023 - 22:21 EST

Investigating

We are experiencing delays in processing data for some newly discovered devices. There is no sign of data loss. We’ll keep you posted on a resolution.

Posted 2 years ago. Feb 08, 2023 - 21:23 EST

This incident affected: Network Mgmt (us1.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, au1.my.auvik.com).