Service disruption
Incident Report for Auvik Networks Inc.
Postmortem

Data Missing from Select Tenants - US2 Cluster

Root Cause Analysis

Duration of incident

Discovered: Sep 23, 2021 16:20 UTC

Resolved: Sep 23, 2021 18:00 UTC

Cause

The Core Data Pump (CDP) stopped producing messages which caused permissions to fail and therefore data was unavailable.

Effect

Some MSP partners reported no global device views could be shown however, inherited child views were unaffected.

When affected, the UI did display a message stating “No Data Available. Please ensure you are authorized to view this type of site data.”

Action taken

09-23-2021

16:20 UTC - Support notified Auvik Engineering that clients were reporting issues with global views.

16:26 UTC - Auvik engineering detected browser errors preventing permissions to view resources.

16:43 UTC - Restarted the permissions service in us2 to remediate, but this did not resolve the core problem.

17:00 UTC - Engineering identified the issue was due to missing TenantStatus for certain sites in the permission cache. Further investigation was conducted for missing values for some tenants.

17:25 UTC - Incident process started. Updated Auvik status page.

17:35 UTC - Identified messages were missing and the root cause of the permissions cache was missing data for the specific sites.

17:37 UTC - Engineering restarted the CDP for the TenantStatus topic to remediate the issue.

17:55 UTC - Monitored logs to ensure the CDP restarted successfully and confirmed device data in the UI for the affected sites.

18:00 UTC - Cleaned up tables from the CDP restart and updated the StatusPage to ‘Operational’.

Future consideration(s)

  • Auvik has created a “Runbook” to identify and resolve issues like this.
  • To address this issue more quickly in the future, Auvik will create an internal alarm to address issues when the CDP does not update the permission cache properly.
  • Auvik is exploring the idea of automatically scheduling restarts of the CDP to refresh its use and resourcing periodically.
Posted Oct 20, 2021 - 14:52 EDT

Resolved
This incident has been resolved.
Posted Sep 23, 2021 - 18:02 EDT
Identified
We’ve identified the source of the service disruption and are working as quickly as possible to restore service.
Posted Sep 23, 2021 - 17:25 EDT
This incident affected: Network Mgmt (us2.my.auvik.com).