Description: Flexera One – IT Visibility – EU – Inventory Data Processing was Delayed
Timeframe: May 12th, 12:20 PM PDT to May 19th, 5:48 AM PDT
Incident Summary
On May 12th, at 12:20 PM PDT, the IT Visibility platform in the EU region experienced an issue that caused delays in the availability of inventory data, affecting customers' ability to access the latest information promptly. The delay was caused by a backlog in processing, resulting in the displayed inventory data being a few hours outdated. It should be noted that the queued data was retained and became accessible once normal processing resumed.
As an initial attempt to resolve the problem, our technical staff tried restarting the affected services, but it did not resolve the problem. We continued to troubleshoot and implement measures to alleviate the issue, and on May 16th, after receiving reports from multiple customers, we fully understood the extent of the impact.
On May 16th, at 5:56 PM PDT, our technical staff conducted a thorough investigation and identified a system-related problem due to different versions being used. Specifically, there was a discrepancy between the latest version of a specific component and the versions utilized by other services. This inconsistency created issues primarily because these services depend on each other.
Further investigation revealed that one of our node instances had been recycled, which is a regular operation to maintain system security and updates. However, the introduction of the new node led to a situation where it had to download the latest version of our component because it didn't have the required information stored. This latest version differed from what was stored on the other nodes. This unexpected difference resulted in compatibility problems across our system.
To promptly address the issue and alleviate any inconvenience experienced by our customers, we took immediate action on May 17th at 4:47 PM PDT. As a short-term measure, we focused on resolving the bug within one of our components in the EU region. Specifically, we implemented an update that ensured PROD-EU utilized the latest version from a few weeks ago. This step proved pivotal in effectively resolving the problem and restoring normal operations.
Once the underlying version issue was resolved, we encountered a higher-than-expected system load following the system restart. To handle this situation carefully, our technical team decided to bring up services one at a time which resulted in a longer-than-anticipated process. We continued to monitor the environment overnight. Despite making some progress, on May 18th, we encountered a subsequent period of instability. However, our team promptly resolved any remaining issues, allowing us to resume processing the remainder of the backlog.
We continued to monitor the environment to ensure successful processing for another day. On May 19th, at 5:48 AM PDT, technical staff confirmed that the backlog had been successfully processed and we were processing data in real-time, after which the incident was considered resolved.
Root Cause
The root cause was identified as the system-related problem arising from using different versions of a specific component. The inconsistency between the latest version of the component and the versions employed by other services created discrepancy and interdependency issues, leading to delays in the availability of inventory data.
This could be attributed to the unexpected system upgrade in one of the systems in the EU region, which further contributed to the observed discrepancy.
Remediation Steps
Future Preventative Measures
• Long-term solution and version control: The problem's root cause has been identified, and our technical staff has devised a plan for a long-term solution. The long-term solution involves using a specific version of a component that won't be automatically upgraded during routine operations in the production environment. This will ensure system stability and avoid potential compatibility issues.