CM Self-Service- Delay in launching the Cloud Apps and processing custom actions

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Cloud Management Platform (Self-Service – Shard 4) - Delay in launching the Cloud Apps and processing custom actions

Timeframe: March 18th, 2024, 2:30 PM to March 19th, 2024, 2:34 PM PDT

Incident Summary

On Monday, March 18th, 2024, at 2:30 PM PDT, our monitoring system alerted us to the degradation of services, potentially disrupting the Cloud Management Platform (CMP) and customers hosted on the CMP Self-Service - Shard 4. This degradation may have impacted customers' ability to launch Cloud Apps and process custom actions in the NAM region.

Investigation revealed that workflow service calls were hitting rate limits, causing multiple retries and resulting in unresponsive services. To address this, the team redeployed all workflow services, temporarily restoring functionality.

However, due to the high backlog of requests queued up on the service, along with new requests, services became unresponsive again, triggering another alert around 5:15 PM PDT. Simultaneously, customers reported issues with the CMP platform around 5:45 PM PDT.

The influx of requests to the service from our systems, stemming from the backlog queue and customer maintenance activities, overwhelmed the system. Despite redeploying the services, the issue persisted. To aid with the investigation, our subject matter experts were further engaged. After extensive debugging and analyzing various solutions, the team adjusted instance types and reduced the overall capacity of affected services around 11:00 PM PDT, which restored service functionality and allowed the backlog queue to be processed gradually.

Our customers were subsequently informed of the implemented changes and the potential for slowness during cloud application launches and custom action executions due to the backlog queue.

To expedite backlog queue processing, we reached out to the service provider to increase the limit on certain APIs, which, despite previous challenges, was acknowledged this time, with an ETA of the following day, Tuesday, March 19th, for the changes to take effect.

Meanwhile, one our customers commenced their maintenance activity on a subset of planned environments around 6:00 AM PDT on Tuesday. Although successful, it potentially took longer than usual.

At 7:00 AM PDT, our service provider initiated the request for the limit increase. Upon its implementation, the efficacy of the adjustment became evident as it facilitated the processing of more backlog requests. Subsequently, stalled processes were promptly addressed, and services were promptly scaled back up. As a result, no rate limit errors were observed, ensuring smoother operations.

The platform was monitored extensively before officially declaring the incident resolved on Tuesday, March 19th, 2024, at 2:34 PM PDT.

Root Cause

Primary Root Cause:

The primary root cause of the incident was the limitation on service calls, which resulted in a backlog and subsequent service degradation.

Contributing Factors:

Additionally, a security update activity involving the rotation of an authentication key the previous night compounded the issue. This, coupled with existing queue challenges, further intensified the throttling problem and impacted service performance.

Remediation Actions

Redeployment of Workflow Services: The team redeployed all workflow services to temporarily restore functionality.
Adjustment of Instance Types and Capacity: After extensive debugging and analysis, the team adjusted instance types and reduced the overall capacity of affected services to gradually process the backlog queue.
Customer Communication: Communication was established with customers to inform them of the implemented changes and the potential for slowness during cloud application launches and custom action executions.
Request for API Limit Increase: To expedite backlog queue processing, a request was made to the service provider to increase the limit on certain APIs, which was further acknowledged by the service provider.
Prompt Initiation of Limit Increase: The service provider promptly initiated the request for the limit increase, resulting in the processing of more backlog requests.
Addressing Stalled Processes and Scaling Services: Subsequently, stalled processes were addressed promptly, and services were scaled back up, leading to the resolution of the incident.

Future Preventative Measure

Enhanced Quota: We collaborated with our service provider to increase the quota by 2x the existing limit, ensuring ample resources to accommodate any additional calls in the future. Since implementing these changes, we have not observed any recurrence of issues.

Posted Mar 26, 2024 - 19:53 PDT

Resolved

This incident has been resolved.

Posted Mar 19, 2024 - 14:44 PDT

Update

We are actively working with our service provider to optimize the system and ensure smoother operations. As part of this effort, adjustments are being made to enhance processing capacity, with updates expected soon. We appreciate your patience and understanding as we work towards a resolution.

Posted Mar 19, 2024 - 07:19 PDT

Monitoring

We've implemented configuration changes and scaled up our infrastructure, resulting in a significant improvement in the situation. Custom actions are currently passing, but some users may experience slowness until the backlog is fully processed. We'll closely monitor the situation and provide updates as we continue to make progress.

Posted Mar 19, 2024 - 01:10 PDT

Identified

We have identified the potential factors contributing to the strain on our services. Our technical teams are presently investigating solutions to address this issue and implement measures to avoid future incidents.

Posted Mar 18, 2024 - 23:39 PDT

Investigating

Incident Description:
We are currently experiencing issues that may affect customers' ability to launch Cloud Apps and process custom actions in the NAM region.

Priority: P2

Restoration activity:
Our technical teams are actively involved in evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue promptly.

Posted Mar 18, 2024 - 21:59 PDT

This incident affected: Legacy Cloud Management (Self-Service - Shard 4).