CMP - Cloud Management RightNet Shard 4 system degraded
Incident Report for Flexera System Status Dashboard
Postmortem

Description:  Customers experienced degraded performance while using the Cloud Management Platform on Shard 4.

Timeframe:  February 8th 11:49pm to February 9th 2:20am PST

Incident Summary

On February 8th at 11:49pm, customers using the Cloud Management Platform began to experience degraded performance while running Rightscript executions and Instance Operations on Shard 4.

 Technical teams were alerted to high error rates by monitoring systems automatically and responded promptly. Investigations confirmed the error rates were significantly higher than normal on the Shard 4 Router service. Additional subject matter experts were engaged to assist with the investigation leading to the discovery of a deprecated configuration item that was causing the Shard 4 Router service to attempt to connect to a service on a recently decommissioned Shard.

 This configuration was removed, and Services were confirmed restored at 2:20am PST on February 9th.

Root Cause

·        The root cause of the high error rates and subsequent performance degradation was found to be the Shard 4 Router service to attempting to connect to a service in the recently decommissioned Shard 10.

·        The team decommissioning Shard 10 was not aware of the configuration dependency in the Shard 4 Shard Router service and as a result had not removed it as part of this activity. The dependency was not discovered in testing as it was not found to be present in any of the other Development or Production Shard’s.

 

Corrective Action

 ·        Documentation has been updated to check for additional inter-Shard dependencies in any future decommissioning run books.

Posted Mar 03, 2021 - 15:09 PST

Resolved
An issue was identified in the underlying RightNet Data Service infrastructure and it is already fixed now. Rightscript and instance operations have been restored.

Some customer's instances may not have recovered automatically from this issue - please restart the Rightlink agent on affected instances if you are still experiencing issues.

https://docs.rightscale.com/rl10/reference/10.6.0/rl10_troubleshooting.html
Posted Feb 09, 2021 - 02:52 PST
Investigating
RightNet infrastructure network in shard 4 is currently degraded. This affects to Rightscript executions and instance operations.
Posted Feb 09, 2021 - 01:32 PST
This incident affected: Legacy Cloud Management (Cloud Management Dashboard - Shard 4, Self-Service - Shard 4).