Possible performance issue EU deployment.
Incident Report for uniFLOW Online
Postmortem

User Impact

Storage resource required for providing device specific print data showed heavily increased latency

Scope of Impact

This incident impacted the EU deployment. The impact was mainly felt by users scanning and utilising the cloud print architecture.

Incident Start Date and Time

·         March 7th, 2022 – 11:00 UTC

Incident End Date and Time

·          March 8th, 2022 - 17:00 UTC

Root Cause

The issue was found to be an increase of the access latency on our Azure storage providing the storage for print jobs that entered uniFLOW via Email, File Upload, Mobile App, Microsoft Universal Print, Chrome Extension or uniFLOW SmartClient with having uniFLOW Online configured as Spool Storage destination. This was brought about by excessive requests overly utilising the available storage account access defined limits.

Incident Details

Monday morning March 7th

The latency on the storage jumped to a value that caused extensive delays to print data delivery. The affected print job types mentioned above could no longer be processed in a reasonable timeframe anymore. In addition, Scan jobs performed during this time saw increased delays but should still have been delivered.

Measures were taken to limit the print processing to avoid running into a saturation of resources. As a result, printing was possible again however the time between requesting your print at the device and until the device started to print was increased and took some time to stabilize.

By mid-afternoon the latency dropped into normal boundaries and the system was operational with only minor delays by early evening.

Tuesday morning March 8th

Field reports and metrics showed the problem started to reappear despite the measures taken the following day. There was a slowdown in printing visible which resulted in another period where printing was delayed or unsuccessful.

Measures were taken to separate the azure storages used for providing the device specific print data necessary to formatting and output job settings.  These changes were reviewed before being moved into production shortly after midday.

With this action the root cause of the problem was resolved, and the delivery of device specific print data was restored to the speed as before the incident, however emergency measures were left in place to avoid any saturation of resources.

By Tuesday evening uniFLOW Online was largely back to normal operational parameters. With the configuration changes to our storage and mitigations in place we closely monitored the situation.

Next Steps

We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Additional metrics and alerting were put in place.
  • A review of other deployments was conducted, and the changes and lessons learnt from this incident will be deployed into other regions.
  • Data access and storage account utilization was reviewed, and improvement action scheduled within development.

Note, this Postmortem is the same for both the 7th and 8th of March incident as they are continuation of the same issue.

Posted Mar 18, 2022 - 12:15 UTC

Resolved
Hello Everyone,

We have confirmation our services are back to normal. Operations will review the telemetry and logging information captured during this incident to improve our system.

Update 8/3/2022:
Post resolution we found there were still issues reported as the system stabilized. Mitigation controls were in place but while the system was under 'recovery' conditions. It was observed that there were further delays this morning and we needed to adjust the settings from yesterday now that we are under normal running conditions. We are watching the system closely while we bring uniFLOW Online back to optimal.

Kind Regards
uniFLOW Onlie Operations Team.
Posted Mar 07, 2022 - 13:55 UTC
Update
Hello Everyone,
We are seeing a marked improvement across the EU deployment. Operations will continue monitor and will provide further updates when the situation is completely resolved.
Posted Mar 07, 2022 - 13:28 UTC
Monitoring
Incident details

Identified:
7 Mar 2022, 11:30 UTC

Incident Scope:
This only has been detected on the EU deployment.

Description:
Our telemetry has alerted us to a performance issues in Scan and Print processing of jobs. This has been tested and confirmed with mitigation steps taken to return the system to normal. We are now monitoring the situation closely and will update here when resolved.

Next Update:
When and as the situation changes.
Posted Mar 07, 2022 - 13:02 UTC
This incident affected: EU Deployment (Printing, Email print, Scanning).