Description: SaaS Manager - NA - Managed Applications Tab was not loading
Timeframe: November 16th, 1:29 AM to November 16th, 3:10 AM PST
Incident Summary
On Thursday, November 16th, at 1:29 AM PST, reports surfaced indicating that SaaS Manager customers were facing difficulties accessing the Managed Applications tab, SaaS Dashboard, and PowerBI Dashboards. This issue specifically affected customers based in the NA region, while users in the EU and APAC regions remained unaffected.
Upon further investigation at 1:59 AM PST, our technical teams identified a potential issue with workload scheduling. It was discovered that the orchestration system unexpectedly scheduled twice the workload possible onto a single node.
Further analysis at 2:45 AM PST revealed that the cluster upgrade, which led to the introduction of new nodes, coincided with the workload balancing process. As the orchestration system attempted to distribute workloads across the cluster, it resulted in an unintended concentration on a single node, causing operational disruptions.
At 3:10 AM PST, all pods on the affected node were started successfully. Subsequently, comprehensive health checks were conducted to confirm the operational status of all services and functionalities.
Following this validation, additional steps were taken to ensure there were no lingering issues affecting stability. After verifying the sustained operational state, the incident was considered resolved.
Root Cause
The incident occurred during the cluster upgrade, which led to the introduction of new nodes, and coincided with the workload
balancing process. As the orchestration system attempted to distribute workloads across the cluster, it resulted in an unintended
concentration on a single node, causing operational disruptions.
Remediation Actions
Future Preventative Measure
Enhancement for System-Critical Pods: Implemented an enhancement to elevate the priority of system-critical pods, ensuring services initialize in the correct order. This prevents processing nodes from getting stuck and mitigates the likelihood of future occurrences.