Processing delays for cloud print and scan (EU)

Incident Report for uniFLOW Online

Postmortem

User Impact

Storage resource required for providing device specific print data showed heavily increased latency

Scope of Impact

This incident impacted the EU deployment. The impact was mainly felt by users scanning and utilising the cloud print architecture.

Incident Start Date and Time

· March 7th, 2022 – 11:00 UTC

Incident End Date and Time

· March 8th, 2022 - 17:00 UTC

Root Cause

The issue was found to be an increase of the access latency on our Azure storage providing the storage for print jobs that entered uniFLOW via Email, File Upload, Mobile App, Microsoft Universal Print, Chrome Extension or uniFLOW SmartClient with having uniFLOW Online configured as Spool Storage destination. This was brought about by excessive requests overly utilising the available storage account access defined limits.

Incident Details

Monday morning March 7th

The latency on the storage jumped to a value that caused extensive delays to print data delivery. The affected print job types mentioned above could no longer be processed in a reasonable timeframe anymore. In addition, Scan jobs performed during this time saw increased delays but should still have been delivered.

Measures were taken to limit the print processing to avoid running into a saturation of resources. As a result, printing was possible again however the time between requesting your print at the device and until the device started to print was increased and took some time to stabilize.

By mid-afternoon the latency dropped into normal boundaries and the system was operational with only minor delays by early evening.

Tuesday morning March 8th

Field reports and metrics showed the problem started to reappear despite the measures taken the following day. There was a slowdown in printing visible which resulted in another period where printing was delayed or unsuccessful.

Measures were taken to separate the azure storages used for providing the device specific print data necessary to formatting and output job settings. These changes were reviewed before being moved into production shortly after midday.

With this action the root cause of the problem was resolved, and the delivery of device specific print data was restored to the speed as before the incident, however emergency measures were left in place to avoid any saturation of resources.

By Tuesday evening uniFLOW Online was largely back to normal operational parameters. With the configuration changes to our storage and mitigations in place we closely monitored the situation.

Next Steps

We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

Additional metrics and alerting were put in place.
A review of other deployments was conducted, and the changes and lessons learnt from this incident will be deployed into other regions.
Data access and storage account utilization was reviewed, and improvement action scheduled within development.

‌

Note, this Postmortem is the same for both the 7th and 8th of March incident as they are continuation of the same issue.

Posted Mar 18, 2022 - 12:16 UTC

Resolved

Hello Everyone,

We are moving this incident to resolved.
After a full investigation we will prepare a post mortem for this incident which will be published in no longer then 5 days.

uniFLOW Operations Team

Comment Update: 16-3-2021
The post mortem is still being worked on and will be released soon.

Posted Mar 08, 2022 - 14:01 UTC

Monitoring

Hello Everyone,

We have successfully applied mitigation controls restoring the EU deployment to normal operations. We are no longer recording delays to Print and Scan processes and will monitor for a further 30 minutes before we close this case.

Posted Mar 08, 2022 - 13:25 UTC

Identified

Hello,

We have identified the cause of this incident and working to rectify it. This has meant some configuration changed that we are carefully applying to the system and monitoring the recovery closely. There has already made marked improvement to the processing times which will continue to improve as the performance improvements are rolled out.

Another updated will be placed within the next hour on our progress.

Posted Mar 08, 2022 - 12:11 UTC

Update

We are continuing to investigate this issue.

Posted Mar 08, 2022 - 11:00 UTC

Update

We are continuing to investigate this issue.

Posted Mar 08, 2022 - 10:59 UTC

Investigating

Identified:
8 Mar 2022, 9:00 UTC

Incident Scope:
This only has been detected on the EU deployment.

Description:
We are seeing a continuation of processing delays from yesterday. This was largely mitigated towards the end of the incident on the 7th.

Unfortunately this has resurfaced and the Operations team or investigating this incident with high priority.

Our telemetry and reports have confirmed the releasing of cloud based print job is delayed. This can also, but to a slightly lesser extent impact scan processing delaying the the jobs getting to the destination service.

Next Update:
When and as the situation changes.

Posted Mar 08, 2022 - 10:58 UTC

This incident affected: EU Deployment (Printing, Email print, Scanning).