Flexera System Status Dashboard Status - Flexera One - ITAM EU - Customers may have experienced Reconciliation delays

Flexera One - ITAM EU - Customers may have experienced Reconciliation delays

Incident Report for Flexera System Status Dashboard

Postmortem

Description:

EU Customers may have experienced Reconciliation delays or Failures

Timeframe:

March 4th @ 5:00pm UTC to March 9th @ 1:38am UTC

Incident Summary

As part of database optimization work during the week leading up to the 4th of March, a new database instance on new disks with higher IOPS capacity was built and tested. On March 4th, ITAM EU was migrated to the new database instance. Health checks found the database was operating normally after the change. Technical staff were alerted to high inventory levels after the change; however, this is expected behavior after a change due to Inventory queues being paused. Technical Staff also found a stuck Library update job which was corrected successfully. No other major issues were found by health checks post change.

On March 7th AM UTC, Support staff were alerted to long running reconciliation jobs by customers. Support then escalated the issue to technical staff. Investigations confirmed that reconciles where running extremely long for some customers and failing for others.

Investigations found the new database disks were under-performing and unable to keep up with database read/write requests. Technical Staff immediately tried to increase disk IOPS capacity but were unable to due to the disk type limitations.

Technical staff requested assistance from AWS Support to correct the disk limitations, however, were told but that wasn’t possible. Technical Staff investigated failing back to the original database however, this was not possible as servers as both database servers had been migrated and were unable to be rolled back.

To resolve the disk limitation issue, technical staff provisioned new high-speed disks from a snapshot of the existing database servers. It took around 8 hours to create the snapshot, and then another 15 hours to complete a database resync. Once the sync was complete, technical staff were able to fail over to the upgraded node. During this resync, Batch Processing for all EU customers was paused to accelerate restoration efforts.

Restoration efforts were completed on March 8th at 11:30pm UTC, as a result reconciliation performance returned to expected levels and customer reconciles were then able to complete successfully.

After monitoring for several hours, the MI was declared resolved @ March 9th @ 1:38am UTC.

‌

Root Cause:

Primary Root Cause:

When AWS disk volumes of the type used for the ITAM Databases are created in sizes bigger than 16TB, changes to IOPS limits aren’t possible after initial creation. As a result, the IOPS limits changes applied during the Change failed. When normal weekday loads resumed, the database disks were unable to keep up with the read/write demands and reconciliation jobs were negatively impacted.

Contributing Causes:

· When the 20TB disk volume for the first of the upgraded Database nodes was created, it was created with a minimal IOPS capability as it was believed we could upgrade the IOPS later when it was needed for production loads.

The changes were tested with a smaller version of new Disk instance type in a test environment – however it was found after the migration that disks bigger than 16TB can’t be modified after creation. This limitation was not documented by AWS – as a result the changes failed to be applied.

· The second Database server should not have been migrated until the first new instance had been proved to be working under normal load conditions – this was done as it was considered non-optimal to run different spec DB servers at the same time. Keeping the original instance would have resulted in a faster restoration process should that have been required.

· The Change should not have been scheduled for a Friday – a Monday change would have allowed us to monitor the new instance under normal loads and significantly reduced the impact duration experienced by customers.

‌

Corrective Action

· Technical teams will investigate improving Batch Job Performance alerting.

· Technical teams have committed to cease implementing major changes on Fridays.

· Technical teams have committed to stagger future multi-server migrations to reduce the risk of customers being impacted by a failed change.

Posted Mar 17, 2022 - 20:29 PDT

Resolved

This incident has been resolved.

Posted Mar 08, 2022 - 17:41 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 08, 2022 - 15:28 PST

Update