SaaS Manager - NA - SaaS Manager is currently unavailable
Incident Report for Flexera System Status Dashboard
Postmortem

Description: SaaS Manager - NA - Managed Applications Tab was not loading

Timeframe: November 16th, 1:29 AM to November 16th, 3:10 AM PST

Incident Summary

On Thursday, November 16th, at 1:29 AM PST, reports surfaced indicating that SaaS Manager customers were facing difficulties accessing the Managed Applications tab, SaaS Dashboard, and PowerBI Dashboards. This issue specifically affected customers based in the NA region, while users in the EU and APAC regions remained unaffected.

Upon further investigation at 1:59 AM PST, our technical teams identified a potential issue with workload scheduling. It was discovered that the orchestration system unexpectedly scheduled twice the workload possible onto a single node.

Further analysis at 2:45 AM PST revealed that the cluster upgrade, which led to the introduction of new nodes, coincided with the workload balancing process. As the orchestration system attempted to distribute workloads across the cluster, it resulted in an unintended concentration on a single node, causing operational disruptions.

At 3:10 AM PST, all pods on the affected node were started successfully. Subsequently, comprehensive health checks were conducted to confirm the operational status of all services and functionalities.

Following this validation, additional steps were taken to ensure there were no lingering issues affecting stability. After verifying the sustained operational state, the incident was considered resolved.

Root Cause

The incident occurred during the cluster upgrade, which led to the introduction of new nodes, and coincided with the workload
balancing process. As the orchestration system attempted to distribute workloads across the cluster, it resulted in an unintended
concentration on a single node, causing operational disruptions.

Remediation Actions

  1. Node Recovery and Pod Initialization: Successfully initiated all pods on the affected node at 3:10 AM PST, restoring normal system operation.
  2. Comprehensive Health Checks (Post-Initiation): Conducted thorough health checks to confirm the operational status of all services and functionalities, ensuring a comprehensive system assessment.
  3. Stability Verification and Additional Measures (Post-Health Checks): Verified the sustained operational state and took additional steps to ensure there were no lingering issues affecting stability, concluding with the resolution of the incident.

Future Preventative Measure

Enhancement for System-Critical Pods: Implemented an enhancement to elevate the priority of system-critical pods, ensuring services initialize in the correct order. This prevents processing nodes from getting stuck and mitigates the likelihood of future occurrences.

Posted Dec 29, 2023 - 09:50 PST

Resolved
This incident has been resolved.
Posted Nov 16, 2023 - 03:53 PST
Investigating
Incident Description:
The SaaS Manager "managed applications" tab is not currently loading.

Priority: 1

Restoration activity:
Technical teams have been engaged and are currently investigating.
Posted Nov 16, 2023 - 02:26 PST
This incident affected: Flexera One - IT Asset Management - North America (IT Asset Management - US SaaS Manager).