Flexera One - IT Asset Management - EU & NA - Stability Enhancements Resulting in Temporary Oracle Data Unavailability
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One – IT Asset Management – EU & NA – Multiple UI Outages

Timeframe:

Region Start Time (PST) End Time (PST) Duration
EU February 6th, 1:27 AM February 6th, 3:17 AM 1 hour, 50 minutes
NA February 6th, 10:51 AM February 6th, 11:37 AM 46 Minutes
EU February 6th, 11:22 PM February 7th, 12:40 AM. 1 hour, 18 minutes
NA February 7th, 12:15 PM February 7th, 2:24 PM 2 hours, 9 minutes
EU February 7th, 11:55 PM February 8th, 12:29 AM 34 Minutes

Incident Summary

We recently experienced recurring disruptions in our IT Asset Management Platform, impacting operations in both the EU and NA regions. These disruptions resulted in unexpected intermittent downtimes and errors for users attempting to access IT Asset Management services.

The initial major outage occurred on February 6th at 2:51 AM PST, affecting the EU region. Preliminary investigations revealed a complex problem involving recurrent system overload, leading to database connectivity and user interface complications. Despite restoring services promptly at 3:18 AM PST, the root cause remained unidentified. As a temporary solution, the team increased the CPU count to enhance potential connections and temporarily suspended specific datasets to enhance operational efficiency in both US and EU production environments.

However, the disruption resurfaced later in the day, impacting the NA region. To mitigate the situation, a failover to the secondary node was initiated swiftly at 11:37 AM PST. Despite these efforts, the root cause remained unclear, prompting ongoing investigations.

Over the following days until February 8th, recurrent outages affected both the EU and NA regions, prompting urgent responses from our technical teams. On February 8th at 9:17 AM PST, our thorough investigations confirmed that the recurring disruptions in our IT Asset Management Platform were primarily caused by the influx of large Oracle Fusion Middleware data, leading to system overload and connectivity issues. The failure of a critical service responsible for writing data to storage further compounded the problem, forcing the system to use an alternative method that overwhelmed the infrastructure.

To address these issues comprehensively, our technical teams devised a hotfix aimed at ensuring sustained stability. This involved refining the data collection process, restoring the failed service, and implementing measures to remove excess data while preserving essential information. These steps, combined with temporary adjustments to data propagation, helped stabilize the system and restore functionality. The solution was successfully deployed on February 9th at 10:55 PM PST.

As a result of this alteration, a small subset of customers encountered incomplete Oracle Fusion Middleware data in their Oracle GLAS Audit Evidence file downloads during the transition period. Due to the complexity of the operation and the significant data processing requirements, our technical teams expected the transition to extend over the next few days. Additionally, once the hotfix was successfully implemented, it was projected that a few additional days would be necessary for the complete integration of Oracle Fusion Middleware data into GLAS files, alongside the arrival of new inventory.

On February 20th at 7:17 AM PST, we enabled the required services in both EU and US regions, marking a crucial milestone in resolving the incident. Confirmation of successful implementation was received on February 21st at 6:07 AM PST, signaling the resolution of the incident. The relevant tables in our data storage systems were verified to be populated with data as expected, indicating the successful restoration of services.

Root Cause

Primary Root Cause:

The recurring disruptions in our IT Asset Management Platform were primarily attributed to the significant influx of large Oracle Fusion Middleware (FMW) data, which led to system overload and connectivity issues.

Contributing Factors:

  1. Failure of Critical Service: The failure of a critical service responsible for writing data to storage exacerbated the issue.
  2. System Overload: The large influx of FMW data led to system overload and connectivity issues.
  3. Alternative Data Writing Method: The system resorted to using an alternative method to write FMW data, overwhelming the infrastructure.
  4. A known Bug in Data Collection: A known bug in the data collection process contributed to the accumulation of excess data (Refer to the future preventative measures section for the solution).
  5. Unexpected Increase in Internal System Requests: An unexpected surge in internal system requests strained the system further, exacerbating stability issues.
  6. Impact on Subset of Customers: During the hotfix implementation, a small subset of customers experienced incomplete Oracle Fusion Middleware (FMW) data.

Remediation Actions:

  1. Increased CPU Count: The team raised the CPU count to enhance potential connections and alleviate system strain.
  2. Temporary Suspension of Datasets: Specific datasets were temporarily suspended to improve operational efficiency in both US and EU production environments.
  3. Swift Failover Initiation: A failover to the secondary node was promptly initiated to mitigate disruptions in the NA region.
  4. Hotfix Development: A hotfix aimed at ensuring sustained stability was developed to address connectivity and operational challenges.
  5. Backup and Restore: Backups were conducted, and certain data were restored to stabilize the system.
  6. Monitoring and Termination: Disruptive jobs were monitored and terminated to minimize system strain.
  7. Enablement of Services: Necessary services were enabled in both the EU and US regions to restore functionality.

Future Preventative Measures

  1. Long-Term Solution Implementation: The proactive implementation of a long-term solution during the incident has led to sustained stability. Following its deployment, we have not encountered any recurrence of the problem, thereby ensuring the reliability of our services and infrastructure.
  2. A Known Bug Requiring Version Upgrade: Initially, a bug in the data collection process led to the accumulation of excess data. This issue primarily affected customers utilizing the data collection for Oracle Fusion Middleware (FMW). Although this bug was addressed and resolved with Agent Version 18.4 (FNMS 2022 R2), it has come to our attention that there was a subsequent bug. To mitigate potential issues, customers who have enabled FMW data collection should ensure they are using Agent Version 19.1 or higher. This updated version provides enhanced functionality and stability, thereby preventing further complications.
  3. Data Management Enhancement: To improve data management in ITAM, we've enacted changes to prevent data from being written to the database when Oracle microservices are unavailable. Robust error handling and notification protocols have been implemented to promptly address any potential issues.
  4. Enhanced Monitoring and Alerting: We aim to implement more robust monitoring systems to continuously track the health and performance of critical services, databases, and microservices. With comprehensive monitoring in place, potential issues or anomalies can be detected early, allowing for proactive intervention before they escalate into significant disruptions.
  5. Regular Software Updates and Patch Management: Instituting a robust software update and patch management strategy to ensure that all systems and software components are up to date with the latest security patches and bug fixes.
Posted Mar 18, 2024 - 20:26 PDT

Resolved
The import has been activated in North America, and the reconciliation process completed successfully. Our technical team has confirmed that the relevant Oracle Fusion data is now accessible in GLAS files again.

Additionally, the reconciliation process has been successful in the EU, with positive trends observed. The incident has been resolved. We apologize for any inconvenience that recent ITAM outages may have caused. We are fully committed to stability and will provide a comprehensive customer report in the coming days.
Posted Feb 21, 2024 - 06:45 PST
Update
The Oracle Fusion data import was successfully enabled in the EU last night, with no reported issues, and our monitoring indicates that everything is proceeding as expected. Following the successful implementation in the EU, the team has commenced the process to enable the import in the NA as well.
Posted Feb 20, 2024 - 06:34 PST
Update
The hotfix is ready to enable Oracle Fusion data import in the US and EU regions. We are preparing to deploy it and will keep you updated.
Posted Feb 19, 2024 - 13:00 PST
Update
The deployment process is proceeding as anticipated. We still plan to enable the data for the EU and NA tomorrow. We will provide a further update tomorrow morning.
Posted Feb 18, 2024 - 09:23 PST
Update
The pending tasks for the EU are nearing completion, but the remaining tasks for NA may extend into this weekend. Technical teams are continuously monitoring, and the current plan is to re-enable the data for both the EU and NA simultaneously once all tasks are finished.
Posted Feb 16, 2024 - 09:51 PST
Update
Based on the latest update, pending tasks in the EU region are nearing completion, allowing for the potential re-enabling of the Oracle Fusion data import in the near term. However, in NA, the completion timeline may exceed our initial estimates, leading to a delay in reactivating the import process. We continue to closely monitor the deployment progress and appreciate your understanding.
Posted Feb 15, 2024 - 07:54 PST
Update
We continue to observe positive developments. We are awaiting the completion of certain processes before we can proceed with re-enabling the Oracle Fusion data import process.
Posted Feb 14, 2024 - 09:10 PST
Update
The deployment has been proceeding smoothly, and we have not encountered any additional interruptions affecting the IT Asset Management UI. However, we expect this transition to require a few more days to finalize. The UI remains fully operational, with all functionalities and datasets accessible, except for Oracle GLAS File Evidence Downloads, which affect only a specific subset of customers.
Posted Feb 13, 2024 - 07:50 PST
Update
We are currently awaiting the completion of certain tasks. Following their completion, we will monitor for new inventory to arrive before reactivating the capability to make Oracle Fusion data available in GLAS files. This cautious approach aims to ensure a smooth reconciliation process without any disruptions. We appreciate your patience and will provide updates as the situation progresses.
Posted Feb 12, 2024 - 08:02 PST
Update
The change is ongoing and proceeding as anticipated. We are continuing to monitor its progress closely.
Posted Feb 10, 2024 - 22:36 PST
Update
The change is progressing as planned. We are actively monitoring the environment and will continue to do so throughout the weekend. We will keep you informed of any further developments as they unfold. Thank you for your patience.
Posted Feb 09, 2024 - 19:05 PST
Monitoring
Incident Description: We have recently experienced multiple outages that intermittently impacted our IT Asset Management platform, affecting customers' ability to access the UI in both the EU and NA regions. To address this, our technical teams developed a long-term solution to prevent future incidents. Following successful testing, these changes have been rolled out to our production environments across both regions. The deployment process is expected to take a few days, with continuous monitoring by our technical teams to ensure successful completion.

During this change, some customers may experience missing data from Oracle GLAS File Evidence Downloads, affecting a specific subset of users in both EU and NA environments. However, the user interface will remain accessible without impacting other functionalities or datasets.

Priority: P2

Current Status (In Progress): Our technical teams are actively monitoring the environment to ensure the successful implementation of the change. Updates will be provided as developments occur.
Posted Feb 09, 2024 - 12:18 PST
This incident affected: Flexera One - IT Asset Management - Europe (IT Asset Management - EU Batch Processing System) and Flexera One - IT Asset Management - North America (IT Asset Management - US Batch Processing System).