Discovered: Feb 8, 2023, 23:05 - UTC
Resolved: Feb 13, 2023, 21:40 - UTC
During a routine database cleanup, a row in a table was unintentionally altered that caused a reload of the entire table instead of updating the specified row. This change caused an unexpectedly high load on the system resources provisioned for new devices.
Slowed down by the excessive load on the system for the processing of new devices, which resulted in a delay in customers being able to map and monitor these devices. The delay for most affected clients was up to a few hours but ranged up to several days in the most extreme cases. All existing devices were unaffected and were monitored normally during the incident.
All times in UTC
02/08/2023
23:05 - Engineering is alerted to a possible SNMP poller issue with clients and begins an investigation.
02/09/2023
01:43 - An incident is declared, and engineering resources are called in to assist with the current incident.
02:03 - Engineering begins to identify the issue and its extent. It commences the investigation for a root cause. A few affected resources are increased to increase data flow.
02:54 - The first discovery of the hierarchy service impact (Parent to child site relationship.) is noticed and investigated.
03:02 - Engineering concluded that the most efficient and safe way to work through the additional data processing without possible data, the loss would be to strategically increase resources to allow the system to process the data.
03:20 - All lag seems to have recovered except for four clusters; AU1, US1, US3, and US4.
05:15 - The lag in AU1 has caught up, and data is now flowing normally. US1 and the US3 lag is holding steady while the US4 lag is recovering.
09:25 - Cluster US4 has recovered.
10:25 - Cluster US3 has recovered. US1 is recovering but slowly.
02/09/2023- 02/13/2023
10:25 - 13:00 - Engineering continues work on the data delay on US1 with tweaks and adjustments to allow the additional data to process while continuing to allow new data into the system. US1 improvements continue with most tenants. There are still several individual sites that do not show improvement, so progress for all tenants on US1 is not simultaneous. The work continued over the scheduled maintenance window. The maintenance window doesn’t affect the data recovery timeline.
02/13/2023
13:00 - The delay under US1 has leveled out and does not seem to improve. Only three partitions left of the original 24 still need to catch up. The decision is made to upgrade the disks processing the data to increase the capacity to allow the data delay to recover completely.
14:00 - 17:00 - The disks are upgraded and placed into the production environment.
17:10 - Lag again begins to drop over the remaining three partitions.
21:40 - The remaining data processing lag has recovered, and systems are processing normally. Auvik’s status page is updated to reflect its status.
Future consideration(s)
● Standard Code review was not followed before the code was introduced to the system. Auvik will review and better enforce
in-place standards with engineering team members.
● Auvik will enforce standard waiting periods for code deployments.
● Auvik will review current alerting to see what can be improved and possible circuit breakers to be employed if required.
● A review of the affected systems and services will be done to remove bottlenecks identified in this incident to prevent a prolonged recovery window as was experienced here.
● Auvik will thoroughly review current hardware to see where updates/upgrades can be made to affect performance positively.