Description: Flexera One – NA – ITAM and other Modules Inaccessible
Timeframe: September 2nd, 2:18 PM to September 2nd, 4:00 PM PDT
Incident Summary:
On September 2nd, at 2:18 PM PDT, during the scheduled maintenance activities to replace one of the old servers used for service discovery, technical staff observed several 503 errors in the internal logs. The health checks also indicated that multiple modules within Flexera One may have experienced a service disruption.
As a result, some customers may have observed an error while accessing ITAM UI views within Flexera One. During the incident, technical staff also received multiple alerts for Cloud Cost Optimization and Automation services. Staff was able to recreate the issue using the demo account and internal test account for ITAM, however, there were no reports from the customers during the outage.
At 2:22 PM PDT, technical staff attempted to unseal the server vault to gain access into the vault and run the stopped operations but observed several errors indicating the whole server cluster was in an unhealthy state.
After further investigation, technical staff found that one of the steps was missed during the server replacement, resulting in issues with the server cluster electing a new leader, bringing the whole cluster down. At 3:33 PM PDT, technical staff attempted to redeploy and restart the impacted server but encountered errors for some dependent services. At 3:48 PM PDT, the dependent services were redeployed as well.
The internal load balancer logs indicated successful connections, and 500 errors were no longer observed. After further health checks and monitoring, at 4:00 PM PDT, the incident was declared resolved.
Root Cause:
Technical staff found that one of the steps was missed during the server replacement, resulting in issues with the server cluster electing a new leader, bringing the whole cluster down.
Corrective Actions:
• Technical staff initiated the deployment of impacted services again by following the correct steps and procedure
• Updated and fixed runbooks to simplify and correct the server replacement and recovery instructions