Cloud Cost Optimization - NAM - Bill Processing Delayed
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One - Cloud Cost Optimization - NAM - Bill Processing Delayed

Timeframe: December 22nd, 10:00 AM to December 24th, 10:04 AM PST

Incident Summary

On Friday, December 22nd, 2023, at 10:00 AM PST, we encountered an issue in the NA region where new Cloud Cost Optimization data for multiple organizations was not being processed. Although the UI remained functional, organizations might have experienced delays in bill processing.

Upon further investigation, it was discovered that one of the data processing clusters had encountered a failure. To promptly address the situation, our technical team migrated to a new cluster at 12:45 PM PST. Following that, manual processing for the affected organizations was expedited.

However, during a subsequent review at 5:46 PM PST, our team discovered that data for only a limited number of organizations had been successfully processed before a recurrence of the same problem. The technical teams continued manual data processing for the affected organizations while simultaneously working to identify the root cause of the issue.

On December 23rd, the team implemented additional enhancements to address the issue, including limiting concurrent processing. These efforts significantly reduced the backlog, reaching notable levels by the afternoon. However, when attempting to increase the number of concurrently processed organizations, the services crashed again.

Upon analyzing the error logs at 1:39 PM PST, we identified an ongoing issue tied to our service provider. Further investigation uncovered a limitation on the maximum allowed file connections in specific clusters, attributed to a recent node version release by our service provider. Notably, services in these clusters operated on a different node version than those in unaffected clusters, contributing to the problem.

To remediate the problem, our service provider initiated a configuration rollback. Subsequently, at 7:56 PM PST, our team aligned our configuration with the service provider's settings. Following this adjustment, the team increased concurrency, aiming to expedite processing.

Following these changes, we continued to observe positive outcomes. The next day, on Sunday, December 24th, at 10:04 AM PST, our technical team confirmed the successful processing of the backlog. Subsequently, the incident was declared resolved.

Root Cause

The issue originated from recent node version upgrades performed by our service provider in specific clusters, resulting in subsequent issues.

Remediation Actions

  1. Configuration Rollback: To address the ongoing issue tied to the service provider, a configuration rollback was initiated by the service provider.
  2. Configuration Alignment: At 7:56 PM PST, our team aligned our configuration with the service provider's settings to mitigate the impact of the issue.
  3. Concurrency Increase: Following the configuration adjustment, the team increased concurrency to expedite overall processing.
  4. Continuous Monitoring and Health Checks: Throughout the incident, continuous monitoring and health checks were executed to ensure the successful processing of the backlog.

Future Preventative Measure

Continued Stability Initiatives: Leveraging the lessons learned from this incident, we are committed to implementing ongoing stability measures. Following the successful resolution, where actions taken have provided stability and no recurrence has been observed, our commitment extends to ensuring a seamless experience. This involves working towards early detection of potential issues and maintaining continuous collaboration with the service provider. These efforts aim to foster faster recovery and proactive measures to uphold system stability in the future.

Posted Jan 30, 2024 - 09:43 PST

Resolved
The remaining backlog has been successfully processed, and this incident has been resolved.
Posted Dec 24, 2023 - 10:00 PST
Update
The recent changes have resulted in positive outcomes. Our technical team is actively monitoring the remaining backlog, and processing is progressing as anticipated.
Posted Dec 24, 2023 - 00:00 PST
Update
We have reverted the configuration and have not observed any additional failures. Additionally, we have improved the infrastructure services to facilitate faster processing. We plan to monitor the system for a while before considering any further enhancements.
Posted Dec 23, 2023 - 20:11 PST
Update
After discussions with our vendor and the implementation of remediation measures by them, we are now taking the necessary steps to revert to the previous configurations. Our efforts to address this issue are ongoing, and we will continue to provide updates as progress is achieved.
Posted Dec 23, 2023 - 17:40 PST
Update
We encountered service failures again during the backlog processing. Upon thorough investigation, it seems that the issue may be connected to our vendor. We have established contact with our vendor and are actively working to promptly resolve the issue.
Posted Dec 23, 2023 - 14:30 PST
Update
We are consistently making steady progress, with the majority of the backlog successfully addressed.
Posted Dec 23, 2023 - 08:37 PST
Monitoring
We implemented a code change last night to optimize data processing and restore services, yielding positive outcomes and a significant reduction in the backlog. Despite this progress, a substantial volume of data still awaits processing. Our technical team will continue to diligently monitor and work through the remaining tasks. We will provide updates as we make further progress.
Posted Dec 23, 2023 - 05:15 PST
Update
Our technical team is actively working on the issue to restore stability. While they successfully imported some data manually, our recent measures to resolve the underlying problem have encountered failures, prompting technicians to reassess the situation and explore alternative solutions.
Posted Dec 22, 2023 - 21:15 PST
Update
We are actively addressing the issue and have made progress in isolating the potential problem. Simultaneously, we are implementing manual processing measures for certain data.
Posted Dec 22, 2023 - 18:07 PST
Investigating
We experienced initial success, but we subsequently faced a recurrence of the same problem. Technical teams are investigating further to address the underlying issue.
Posted Dec 22, 2023 - 15:36 PST
Update
The team has successfully implemented a solution and observed positive outcomes. Simultaneously, we are taking manual steps to expedite the processing. We will continue to monitor the situation and provide updates as progress is made.
Posted Dec 22, 2023 - 13:15 PST
Identified
We are actively addressing the ongoing issue. Our technical team has identified a potential issue and is working on a remediation plan. Updates will be provided as progress is made
Posted Dec 22, 2023 - 12:19 PST
Investigating
Incident Description: We are currently experiencing an issue in the NA region where new Cloud Cost Optimization data for multiple organizations is not being processed. While the UI remains functional, organizations may be experiencing delays in bill processing.

Priority: P2

Restoration Activity: Our technical teams are actively involved and are evaluating the situation. Additionally, we are exploring potential solutions to rectify the issue as quickly as possible. We sincerely apologize for any inconvenience this may have caused.
Posted Dec 22, 2023 - 10:46 PST
This incident affected: Flexera One - Cloud Management - North America (Cloud Cost Optimization - US).