FlexNet Manager Suite Cloud - EU - Batch Processing and Business Reporting are offline
Incident Report for Flexera System Status Dashboard
Postmortem

Description:

Batch Processing & Business Reporting were unavailable for 4 hours and 50 minutes delaying Customer batch jobs and preventing Customer reports from being run.

Timeframe:

October 1st, 9:30pm CEST to October 2nd, 4:20am CEST

Incident Summary

On October 2 @ 12:30am CEST the FlexNet Manager Suite 2020 R1.2 Production Deployment start was delayed due to an Inventory Database process that was still running. Technical teams were unable to proceed due to this process being in a rollback state. The maintenance window was extended by 2.5 hours as a result.

Once the Production Deployment was finished technical teams, while performing environment health checks, found that Business reporting was still offline. Technical teams also received automated failure emails advising that the automated deployment process had missed updating a number of components. As a result, the batch processing error rates exceeded normal thresholds triggering alert notifications. The Beacon policy update also failed due to a certificate issue. Technical teams found that importing the updated private key had failed – this was corrected by reverting to a method to update the certificate.

As a result of the issues found during health checks, Batch processing was disabled while teams investigated the issues. Technical teams found a deployment automation issue which was fixed – updated components were re-deployed successfully and Batch processing was re-enabled.

Technical teams then discovered around 20% of Customer batch jobs containing Citrix evidence were failing. The SQL responsible for the failures was identified. Technical teams analysed the customer Citrix evidence triggering the batch failures and updated the SQL to address the unanticipated customer use cases. This was implemented as a hot-fix and successfully deployed.

After health checks were completed, services were confirmed restored on Friday, October 2 @ 4:20am CEST.

Root Cause

  • Automation failures - a Latent issue had previously been corrected downstream within the automation scripts. Recent changes to automation scripts, after moving to the new system, no longer corrected for this pre-existing issue as the old system had done. These issues did not present themselves in UAT testing due to environmental differences.
  • Citrix code Changes - Engineering found Batch writer failures due to a violation of a unique key constraint. The SQL to populate a table was updated however, the assumption made as to the shape of the data that proved to be incorrect. Citrix is deployed in sites, eg per country and each site has a unique site name, the SQL was trying to bring in the site name however, some site names came back as null as a result of some Citrix server agents being used in a manner not anticipated.

Corrective Action

  • No follow up actions are required due to the root causes being found and remediated during Incident restoration activities.
Posted Oct 15, 2020 - 22:35 PDT

Resolved
This incident has been resolved.
Posted Oct 01, 2020 - 19:19 PDT
Update
We are continuing to monitor for any further issues.
Posted Oct 01, 2020 - 17:35 PDT
Monitoring
Remediation activities have been completed successfully - Batch Processing and Business Reporting have now been restored to full functionality.
Posted Oct 01, 2020 - 17:34 PDT
Update
Due to the additional issues found, Batch Processing has been paused. This will result in delays to batch processing.
Posted Oct 01, 2020 - 16:46 PDT
Identified
Batch Processing remediation activities have been completed successfully - Batch Processing has now been restored to full functionality.

Technical teams are continuing to investigate issues with Business Reporting.
Posted Oct 01, 2020 - 16:20 PDT
Update
We are continuing to investigate this issue.
Posted Oct 01, 2020 - 15:45 PDT
Investigating
Incident Description:
Due to issues uncovered during the recent maintenance activity, Batch Processing and Business Reporting have been taken offline. This will result in delays to batch processing.

Priority: 2

Restoration activity:
Technical teams have been engaged and are currently investigating.

Next Update: Wednesday, March 20 @ 3am CEST
Posted Oct 01, 2020 - 15:43 PDT
This incident affected: Flexera One - IT Asset Management - Europe (IT Asset Management - EU Batch Processing System, IT Asset Management - EU Business Reporting).