Description: Flexera One - Cloud Cost Optimization - NAM - Bill Processing Delayed
Timeframe: December 22nd, 10:00 AM to December 24th, 10:04 AM PST
Incident Summary
On Friday, December 22nd, 2023, at 10:00 AM PST, we encountered an issue in the NA region where new Cloud Cost Optimization data for multiple organizations was not being processed. Although the UI remained functional, organizations might have experienced delays in bill processing.
Upon further investigation, it was discovered that one of the data processing clusters had encountered a failure. To promptly address the situation, our technical team migrated to a new cluster at 12:45 PM PST. Following that, manual processing for the affected organizations was expedited.
However, during a subsequent review at 5:46 PM PST, our team discovered that data for only a limited number of organizations had been successfully processed before a recurrence of the same problem. The technical teams continued manual data processing for the affected organizations while simultaneously working to identify the root cause of the issue.
On December 23rd, the team implemented additional enhancements to address the issue, including limiting concurrent processing. These efforts significantly reduced the backlog, reaching notable levels by the afternoon. However, when attempting to increase the number of concurrently processed organizations, the services crashed again.
Upon analyzing the error logs at 1:39 PM PST, we identified an ongoing issue tied to our service provider. Further investigation uncovered a limitation on the maximum allowed file connections in specific clusters, attributed to a recent node version release by our service provider. Notably, services in these clusters operated on a different node version than those in unaffected clusters, contributing to the problem.
To remediate the problem, our service provider initiated a configuration rollback. Subsequently, at 7:56 PM PST, our team aligned our configuration with the service provider's settings. Following this adjustment, the team increased concurrency, aiming to expedite processing.
Following these changes, we continued to observe positive outcomes. The next day, on Sunday, December 24th, at 10:04 AM PST, our technical team confirmed the successful processing of the backlog. Subsequently, the incident was declared resolved.
Root Cause
The issue originated from recent node version upgrades performed by our service provider in specific clusters, resulting in subsequent issues.
Remediation Actions
Future Preventative Measure
Continued Stability Initiatives: Leveraging the lessons learned from this incident, we are committed to implementing ongoing stability measures. Following the successful resolution, where actions taken have provided stability and no recurrence has been observed, our commitment extends to ensuring a seamless experience. This involves working towards early detection of potential issues and maintaining continuous collaboration with the service provider. These efforts aim to foster faster recovery and proactive measures to uphold system stability in the future.