Description: Cloud Management Platform (Self-Service – Shard 4) - Delay in launching the Cloud Apps and processing custom actions
Timeframe: March 18th, 2024, 2:30 PM to March 19th, 2024, 2:34 PM PDT
Incident Summary
On Monday, March 18th, 2024, at 2:30 PM PDT, our monitoring system alerted us to the degradation of services, potentially disrupting the Cloud Management Platform (CMP) and customers hosted on the CMP Self-Service - Shard 4. This degradation may have impacted customers' ability to launch Cloud Apps and process custom actions in the NAM region.
Investigation revealed that workflow service calls were hitting rate limits, causing multiple retries and resulting in unresponsive services. To address this, the team redeployed all workflow services, temporarily restoring functionality.
However, due to the high backlog of requests queued up on the service, along with new requests, services became unresponsive again, triggering another alert around 5:15 PM PDT. Simultaneously, customers reported issues with the CMP platform around 5:45 PM PDT.
The influx of requests to the service from our systems, stemming from the backlog queue and customer maintenance activities, overwhelmed the system. Despite redeploying the services, the issue persisted. To aid with the investigation, our subject matter experts were further engaged. After extensive debugging and analyzing various solutions, the team adjusted instance types and reduced the overall capacity of affected services around 11:00 PM PDT, which restored service functionality and allowed the backlog queue to be processed gradually.
Our customers were subsequently informed of the implemented changes and the potential for slowness during cloud application launches and custom action executions due to the backlog queue.
To expedite backlog queue processing, we reached out to the service provider to increase the limit on certain APIs, which, despite previous challenges, was acknowledged this time, with an ETA of the following day, Tuesday, March 19th, for the changes to take effect.
Meanwhile, one our customers commenced their maintenance activity on a subset of planned environments around 6:00 AM PDT on Tuesday. Although successful, it potentially took longer than usual.
At 7:00 AM PDT, our service provider initiated the request for the limit increase. Upon its implementation, the efficacy of the adjustment became evident as it facilitated the processing of more backlog requests. Subsequently, stalled processes were promptly addressed, and services were promptly scaled back up. As a result, no rate limit errors were observed, ensuring smoother operations.
The platform was monitored extensively before officially declaring the incident resolved on Tuesday, March 19th, 2024, at 2:34 PM PDT.
Root Cause
Primary Root Cause:
The primary root cause of the incident was the limitation on service calls, which resulted in a backlog and subsequent service degradation.
Contributing Factors:
Additionally, a security update activity involving the rotation of an authentication key the previous night compounded the issue. This, coupled with existing queue challenges, further intensified the throttling problem and impacted service performance.
Remediation Actions
Future Preventative Measure
Enhanced Quota: We collaborated with our service provider to increase the quota by 2x the existing limit, ensuring ample resources to accommodate any additional calls in the future. Since implementing these changes, we have not observed any recurrence of issues.