Flexera One – Cloud Cost Optimization – NA, EU & APAC – Azure Data Import Failure
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One – Cloud Cost Optimization – NA, EU & APAC – Azure Data Import Failure

Timeframe: May 2nd, 12:00 AM to May 9th, 12:00 PM PDT

Incident Summary

We encountered a service disruption on May 2nd, which impacted the availability of Azure bills and policies. Consequently, customers may have faced difficulties when accessing new Azure Policy imports through the Cloud Cost Optimization Platform, leading to the unavailability of precise billing data for our customers.

Upon conducting a thorough examination, we reached out to our service provider and discovered that they had implemented changes to their API configuration on May 2nd. These alterations directly impacted the connectivity of Azure policies, resulting in the emergence of subsequent issues. The issue remained undetected initially because of insufficient monitoring measures in place. However, on May 8th, we began receiving multiple reports from the impacted customers, which alerted us to the situation.

To address the issue and minimize its impact on our external customers, we implemented a workaround on May 8th at 9 PM PDT. This involved refining our data retrieval process by adopting a segmented approach, making separate API calls in smaller segments instead of retrieving all the data at once. This adjustment significantly enhanced the efficiency of data handling and processing, leading to improved overall performance and reduced strain on the system.

After implementing the workaround, data processing operations resumed their normal functioning. However, due to the substantial volume of data, the system experienced a backlog, resulting in a delay. It took several hours to fully process and catch up on the data, with the successful backlog completion occurring on May 9th at 12:00 PM PDT.

On May 14th, our service provider successfully deployed a hotfix to mitigate the issues stemming from the API configuration. This intervention enabled the retrieval of a larger segment of data from the API without encountering any further complications.

Root Cause

Upon investigation, it was determined that the root cause of the service disruption was attributed to changes made by our service provider to their API configuration on May 2nd. These changes had unintended consequences, impacting the functionality and proper integration of Azure bills and policies.

Remediation Steps

  1. Immediate Communication: Upon receiving reports of the service disruption, we promptly initiated communication with our service provider to investigate the issue and identify the root cause.
  2. Workaround Implementation: On May 8th, at 9 PM PDT, we implemented a workaround to address the ongoing issues impacting Azure policies. The workaround involved refining our data retrieval process by adopting a segmented approach; making separate API calls in smaller segments. This adjustment improved data handling efficiency and reduced strain on the system.
  3. Backlog Processing: As a result of the workaround, the system experienced a backlog due to the substantial volume of data. We worked diligently to process and catch up on the backlog, which was fully resolved on May 9th at 12:00 PM PDT.
  4. Hotfix Deployment: On May 14th, our service provider deployed a hotfix to mitigate the API configuration issues. This intervention successfully enabled the retrieval of a larger segment of data from the API without further complications.

Future Preventative Measures

  1. Enhanced Monitoring: We acknowledge the importance of robust monitoring measures to proactively detect and address potential issues. To enhance our proactive approach, we have implemented robust monitoring measures, including specific alerts, to promptly detect and address any data unavailability from the service provider. This allows us to swiftly respond and resolve such issues in the future.
  2. Regular Service Provider Communication: We will maintain open and regular communication with our service provider to stay informed about any upcoming changes or updates that may impact our services. This proactive approach will allow us to prepare and mitigate potential disruptions effectively.
  3. Continuous Improvement: We are committed to continuously improving our systems and processes based on customer feedback and industry best practices. Regular evaluations and updates will be implemented to enhance overall service reliability and performance.
Posted Jun 01, 2023 - 11:45 PDT

Resolved
This incident has been resolved.
Posted May 09, 2023 - 16:29 PDT
Update
Our technical team has been monitoring the environment and verified that data processing is now functioning normally. Data syncing is currently in progress.

Note: This issue only impacted a specific subset of Azure customers with an EA agreement, while others were not affected.
Posted May 09, 2023 - 10:38 PDT
Monitoring
Incident Description: We are currently experiencing a service disruption that affects the availability of Azure bills and policies. As a result, customers may encounter issues when accessing new Azure Policy imports through the Cloud Cost Optimization Platform.

Priority: P2

Restoration Activity: We have identified the issue to be on the vendor's side, and they have taken necessary actions to mitigate the problem. We are closely monitoring the environment for any anomalies to ensure seamless processing.
Posted May 09, 2023 - 08:00 PDT
This incident affected: Flexera One - Cloud Management - Europe (Cloud Cost Optimization - EU) and Flexera One - Cloud Management - North America (Cloud Cost Optimization - US).