Description: Customers experienced Rightscript executions and audit entries failing while using the Cloud Management Platform on Shard 4.
Timeframe: May 8th 01:54 am to May 8th 06:30 am PDT
Incident Summary
On Saturday May 8th at 01:54 am PDT, customers using the Cloud Management Platform reported experiencing Rightscript executions and audit entries failing while using the Cloud Management Platform. Technical teams were engaged and confirmed customer reports. While investigating the issue, services began operating normally and technical staff were unable to find the cause or what resolved the issues.
After monitoring services for a number of hours, the Incident was declared resolved at 06:30 am PDT on May 8th.
On Monday May 10th, additional investigations revealed that a number of Instances had not been “discovered” and made available to manage. Technical Staff ran a manual clean up script to update the database with the missing instances.
Root Cause
• Despite a thorough investigation by Engineering and SRE staff, the root cause of these issues is unable to be determined.
Corrective Action
• Additional logging has been enabled in case this event reoccurs.
• Auditing processes have been uplifted to alert if in the future new Instances are not updated by discovery services.
• Auditing processes will now automatically update any “undiscovered” instances on an hourly basis.
• RightNet Agent management services have been uplifted to include more comprehensive alerting and monitoring, including the ability to repair Agent disconnects automatically on an hourly basis.