Description: Customers experienced degraded performance while using the Cloud Management Platform on Shard 4.
Timeframe: February 8th 11:49pm to February 9th 2:20am PST
Incident Summary
On February 8th at 11:49pm, customers using the Cloud Management Platform began to experience degraded performance while running Rightscript executions and Instance Operations on Shard 4.
Technical teams were alerted to high error rates by monitoring systems automatically and responded promptly. Investigations confirmed the error rates were significantly higher than normal on the Shard 4 Router service. Additional subject matter experts were engaged to assist with the investigation leading to the discovery of a deprecated configuration item that was causing the Shard 4 Router service to attempt to connect to a service on a recently decommissioned Shard.
This configuration was removed, and Services were confirmed restored at 2:20am PST on February 9th.
Root Cause
· The root cause of the high error rates and subsequent performance degradation was found to be the Shard 4 Router service to attempting to connect to a service in the recently decommissioned Shard 10.
· The team decommissioning Shard 10 was not aware of the configuration dependency in the Shard 4 Shard Router service and as a result had not removed it as part of this activity. The dependency was not discovered in testing as it was not found to be present in any of the other Development or Production Shard’s.
Corrective Action
· Documentation has been updated to check for additional inter-Shard dependencies in any future decommissioning run books.