Flexera One – IT Visibility – EU – Inventory Data Processing has been Delayed
Incident Report for Flexera System Status Dashboard
Postmortem

Description: Flexera One – IT Visibility – EU – Inventory Data Processing was Delayed

Timeframe: May 12th, 12:20 PM PDT to May 19th, 5:48 AM PDT

Incident Summary

On May 12th, at 12:20 PM PDT, the IT Visibility platform in the EU region experienced an issue that caused delays in the availability of inventory data, affecting customers' ability to access the latest information promptly. The delay was caused by a backlog in processing, resulting in the displayed inventory data being a few hours outdated. It should be noted that the queued data was retained and became accessible once normal processing resumed.

As an initial attempt to resolve the problem, our technical staff tried restarting the affected services, but it did not resolve the problem. We continued to troubleshoot and implement measures to alleviate the issue, and on May 16th, after receiving reports from multiple customers, we fully understood the extent of the impact.

On May 16th, at 5:56 PM PDT, our technical staff conducted a thorough investigation and identified a system-related problem due to different versions being used. Specifically, there was a discrepancy between the latest version of a specific component and the versions utilized by other services. This inconsistency created issues primarily because these services depend on each other.

Further investigation revealed that one of our node instances had been recycled, which is a regular operation to maintain system security and updates. However, the introduction of the new node led to a situation where it had to download the latest version of our component because it didn't have the required information stored. This latest version differed from what was stored on the other nodes. This unexpected difference resulted in compatibility problems across our system.

To promptly address the issue and alleviate any inconvenience experienced by our customers, we took immediate action on May 17th at 4:47 PM PDT. As a short-term measure, we focused on resolving the bug within one of our components in the EU region. Specifically, we implemented an update that ensured PROD-EU utilized the latest version from a few weeks ago. This step proved pivotal in effectively resolving the problem and restoring normal operations.

Once the underlying version issue was resolved, we encountered a higher-than-expected system load following the system restart. To handle this situation carefully, our technical team decided to bring up services one at a time which resulted in a longer-than-anticipated process. We continued to monitor the environment overnight. Despite making some progress, on May 18th, we encountered a subsequent period of instability. However, our team promptly resolved any remaining issues, allowing us to resume processing the remainder of the backlog.

We continued to monitor the environment to ensure successful processing for another day. On May 19th, at 5:48 AM PDT, technical staff confirmed that the backlog had been successfully processed and we were processing data in real-time, after which the incident was considered resolved.

Root Cause

The root cause was identified as the system-related problem arising from using different versions of a specific component. The inconsistency between the latest version of the component and the versions employed by other services created discrepancy and interdependency issues, leading to delays in the availability of inventory data.

This could be attributed to the unexpected system upgrade in one of the systems in the EU region, which further contributed to the observed discrepancy.

Remediation Steps

  1. Identified the issue: A thorough investigation was conducted to identify the root cause of the problem, which was found to be a discrepancy between different versions of a specific component due to an unexpected system upgrade.
  2. Interim Bug Fix Implementation: As a short-term measure, we implemented an immediate fix for the issue by addressing the bug in the affected component within the EU region. This temporary solution allowed us to resolve the problem promptly and mitigate its impact on our customers.
  3. Careful service restoration: The technical team brought up services one at a time to handle the higher-than-expected system load following the system restart. This approach allowed for careful monitoring and ensured stability during the restoration process.
  4. Continuous monitoring: The environment was continuously monitored following the service restoration to identify any remaining issues or periods of instability. Any problems that arose were promptly resolved.
  5. Backlog Processing: The team focused on processing the backlog of data once the underlying version issue was resolved. The progress was closely monitored to ensure the successful processing of all queued data.

Future Preventative Measures

• Long-term solution and version control: The problem's root cause has been identified, and our technical staff has devised a plan for a long-term solution. The long-term solution involves using a specific version of a component that won't be automatically upgraded during routine operations in the production environment. This will ensure system stability and avoid potential compatibility issues.

Posted Jul 19, 2023 - 10:52 PDT

Resolved
The backlog has been successfully cleared, and we have now resumed processing data in real-time. The incident has been resolved.
Posted May 19, 2023 - 06:17 PDT
Update
We are working on the backlog and expect to complete it in a few hours. The next update will be later today.
Posted May 19, 2023 - 03:51 PDT
Monitoring
The processing of the backlog is currently underway and is projected to require a few more hours to finish. We will provide the subsequent update later today.
Posted May 18, 2023 - 11:10 PDT
Update
Partial progress was made in clearing the backlog; however, we encountered a subsequent period of instability. The system has regained stability, and we are preparing to recommence the processing of the remaining backlog.
Posted May 18, 2023 - 07:06 PDT
Update
Our technical team has resolved the underlying issue, but we encountered a higher system load during the overall system restart. As a precautionary measure, we are gradually bringing up services one at a time, resulting in a longer-than-expected process. We apologize for any inconvenience caused and assure you that we are working diligently to restore normal operations as soon as possible.
Posted May 17, 2023 - 16:59 PDT
Identified
Our technical team has discovered a potential issue with a vital component that is currently causing disruptions. Our team is diligently working to address and resolve the issue, and we will keep you informed of any developments as we make progress toward a solution
Posted May 17, 2023 - 12:13 PDT
Investigating
Incident Description: Our IT Visibility platform is currently facing an issue that is causing delays in the visibility of inventory data. Customers may experience a delay in accessing the most up-to-date inventory information due to a processing backlog. As a result, the displayed inventory data may be a few hours behind.

Please note that the queued data has been retained and will become accessible once normal processing resumes.

Priority: P2

Restoration Activity: We apologize for the inconvenience caused. Our technical teams are actively engaged in addressing and resolving the issue.
Posted May 17, 2023 - 09:28 PDT
This incident affected: Flexera One - IT Visibility - Europe (IT Visibility EU).