Description: Flexera One – IT Asset Management – EU & NA – Multiple UI Outages
Timeframe:
Region Start Time (PST) End Time (PST) Duration
EU February 6th, 1:27 AM February 6th, 3:17 AM 1 hour, 50 minutes
NA February 6th, 10:51 AM February 6th, 11:37 AM 46 Minutes
EU February 6th, 11:22 PM February 7th, 12:40 AM. 1 hour, 18 minutes
NA February 7th, 12:15 PM February 7th, 2:24 PM 2 hours, 9 minutes
EU February 7th, 11:55 PM February 8th, 12:29 AM 34 Minutes
Incident Summary
We recently experienced recurring disruptions in our IT Asset Management Platform, impacting operations in both the EU and NA regions. These disruptions resulted in unexpected intermittent downtimes and errors for users attempting to access IT Asset Management services.
The initial major outage occurred on February 6th at 2:51 AM PST, affecting the EU region. Preliminary investigations revealed a complex problem involving recurrent system overload, leading to database connectivity and user interface complications. Despite restoring services promptly at 3:18 AM PST, the root cause remained unidentified. As a temporary solution, the team increased the CPU count to enhance potential connections and temporarily suspended specific datasets to enhance operational efficiency in both US and EU production environments.
However, the disruption resurfaced later in the day, impacting the NA region. To mitigate the situation, a failover to the secondary node was initiated swiftly at 11:37 AM PST. Despite these efforts, the root cause remained unclear, prompting ongoing investigations.
Over the following days until February 8th, recurrent outages affected both the EU and NA regions, prompting urgent responses from our technical teams. On February 8th at 9:17 AM PST, our thorough investigations confirmed that the recurring disruptions in our IT Asset Management Platform were primarily caused by the influx of large Oracle Fusion Middleware data, leading to system overload and connectivity issues. The failure of a critical service responsible for writing data to storage further compounded the problem, forcing the system to use an alternative method that overwhelmed the infrastructure.
To address these issues comprehensively, our technical teams devised a hotfix aimed at ensuring sustained stability. This involved refining the data collection process, restoring the failed service, and implementing measures to remove excess data while preserving essential information. These steps, combined with temporary adjustments to data propagation, helped stabilize the system and restore functionality. The solution was successfully deployed on February 9th at 10:55 PM PST.
As a result of this alteration, a small subset of customers encountered incomplete Oracle Fusion Middleware data in their Oracle GLAS Audit Evidence file downloads during the transition period. Due to the complexity of the operation and the significant data processing requirements, our technical teams expected the transition to extend over the next few days. Additionally, once the hotfix was successfully implemented, it was projected that a few additional days would be necessary for the complete integration of Oracle Fusion Middleware data into GLAS files, alongside the arrival of new inventory.
On February 20th at 7:17 AM PST, we enabled the required services in both EU and US regions, marking a crucial milestone in resolving the incident. Confirmation of successful implementation was received on February 21st at 6:07 AM PST, signaling the resolution of the incident. The relevant tables in our data storage systems were verified to be populated with data as expected, indicating the successful restoration of services.
Root Cause
Primary Root Cause:
The recurring disruptions in our IT Asset Management Platform were primarily attributed to the significant influx of large Oracle Fusion Middleware (FMW) data, which led to system overload and connectivity issues.
Contributing Factors:
Remediation Actions:
Future Preventative Measures