Audit Entry updates have been disabled temporarily. Also, we are seeing intermittent issues with RightLink connectivity when scheduling RightScripts. Please restart the RightLink agent to resolve.
Incident Report for Flexera System Status Dashboard
Postmortem

Description:  

Customers experienced errors and were unable to utilize the Cloud Management Platform on Shards 3 and 4.

Timeframe:  September 3rd 3:30pm to September 4th 10:52am PDT

Incident Summary

On Thursday September 3rd customers using the Cloud Management Platform reported API endpoints errors and failing scripts. Errors also occurred in the Dashboard, Self Service and Governance components.

 Investigations confirmed the error rates were much higher than average and concluded higher than average customer traffic volumes were contributing to high CPU levels on the Audit instances. The instances impacted were scaled up to larger versions improving response times and lowering CPU loads. However, the errors rates did not significantly improve.

 On September 3rd at 9:55pm PDT the decision was made to take Cloud Management, Self Service and Governance dashboards and Audit-entry updates offline in order to Investigate the issues in an isolated environment. These investigations revealed that the additional traffic volumes were the result of the RightNet routers being hit with millions of audit entry requests for a customer account that was no longer active.

 Despite the account having been deleted, the CMP instances were still able to authenticate it using the instance API token. The instance API token OAUTH call should have checked to make sure that the account was not deleted. If an account is deleted, it should return a 204 error. If this happens then RightLink will not authenticate and as a result not send the audit entries at all and not put excessive load on the Audit Entry Instances.

 This change was put into production manually at 10:07 am PDT on September 4th and all functionality was re-enabled.

 Services were confirmed restored on September 4th at 10:52 am PDT.

Root Cause

  • The root cause of the high error rates and subsequent outage was found to be the Audit Entry Instances and RightNet Routers being overwhelmed with traffic from a former customer account.
  • Despite the account having been deleted, the CMP instances were still able to authenticate it using the instance API token. The instance API token OAUTH call should have checked to make sure that the account was not deleted. If an account is deleted, it should return a 204 error. If this happens then RightLink will not authenticate and as a result not send the audit entries at all and not put excessive load on the Audit Entry Instances.

  

Corrective Action

  •  No follow up actions are required due to the root cause being found and remediated during Incident restoration activities.
Posted Oct 01, 2020 - 19:31 PDT

Resolved
This incident has been resolved.
Posted Sep 04, 2020 - 10:52 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 04, 2020 - 10:07 PDT
Update
We are continuing to work on a fix for this issue.
Posted Sep 03, 2020 - 23:13 PDT
Update
We are continuing to work on a fix for this issue.
Posted Sep 03, 2020 - 22:54 PDT
Identified
In order to mitigate issues we're seeing with access the the us-3 and us-4 Cloud Management, Self Service and Governance dashboards, audit-entry updates in Cloud Management have been disabled temporarily.
Posted Sep 03, 2020 - 21:55 PDT
This incident affected: Legacy Cloud Management (Cloud Management Dashboard - Shard 3).