Customers experienced errors and were unable to utilize the Cloud Management Platform on Shards 3 and 4.
Timeframe: September 3rd 3:30pm to September 4th 10:52am PDT
On Thursday September 3rd customers using the Cloud Management Platform reported API endpoints errors and failing scripts. Errors also occurred in the Dashboard, Self Service and Governance components.
Investigations confirmed the error rates were much higher than average and concluded higher than average customer traffic volumes were contributing to high CPU levels on the Audit instances. The instances impacted were scaled up to larger versions improving response times and lowering CPU loads. However, the errors rates did not significantly improve.
On September 3rd at 9:55pm PDT the decision was made to take Cloud Management, Self Service and Governance dashboards and Audit-entry updates offline in order to Investigate the issues in an isolated environment. These investigations revealed that the additional traffic volumes were the result of the RightNet routers being hit with millions of audit entry requests for a customer account that was no longer active.
Despite the account having been deleted, the CMP instances were still able to authenticate it using the instance API token. The instance API token OAUTH call should have checked to make sure that the account was not deleted. If an account is deleted, it should return a 204 error. If this happens then RightLink will not authenticate and as a result not send the audit entries at all and not put excessive load on the Audit Entry Instances.
This change was put into production manually at 10:07 am PDT on September 4th and all functionality was re-enabled.
Services were confirmed restored on September 4th at 10:52 am PDT.