AU: uniFLOW Online SmartClients lose connection to uniFLOW Online (Resurfaced)
Incident Report for uniFLOW Online
Postmortem

User Impact

During the incident users would experience the uniFLOW SmartClient disconnecting from the cloud and forced into emergency mode. This would still allow printing but at reduced usability and functionality.

Scope of Impact

This issue was isolated to our AU (Australian) Deployment

Incident Start Date and Time

Jun 20, 2022, 1:30 AM UTC

Incident End Date and Time

Jul 07, 2022, 8:30 AM UTC (Mitigation controls in place and no further issues detected. The status page remained in a monitoring state until the 12th of July).

Root Cause

This issue originally presented as a resource issue with the Azure IoT Hub. However, the connections and consumed resources did not match expected usage patterns. We could validate this by comparing to several of our other larger Azure deployments.

Considering a possible issue in the IoT stack we redeployed the IoT Hub into a new Azure Data Centre. We quickly saw the issue reappear and raised a support case with Microsoft Support Engineers to look deeper into the IoT metrics.

It was found the SmartClient was suddenly making multiple connections to the IoT Hub. This multiplying effect meant that we saw a huge rise in connection in comparison to the number of SmartClients and the expected usage pattern.

During deeper analysis, the root cause was found to be due to the uniFLOW SmartClient being used in environments with a unique network configuration where more than one proxy may be used. When the second proxy connection is established a second connection is established which causes the first connection to disconnect. This, in turn, causes a retry to take place from the first connection after which the two connections continuously disconnect and reconnect resulting in the resource exhaustion.

IoT Hub resource allocation has been optimized based on this finding in order to ensure a stable uniFLOW Online environment. Our development team has improved the uniFLOW SmartClient to better handle proxied environments already providing a fix which is currently going through our Quality Control process. Once this fix has been tested and verified it will be included in the 2022.3 release of uniFLOW Online.

Next Steps

We apologize for the impact to affected customers. We are continuously taking steps to improve the uniFLOW Online Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

  • Make available the uniFLOW SmartClient so tenant admins can package and redeploy to all users. This is planned for 2022.3, release is scheduled to take place in September.
  • We are further optimizing our product to reduce unnecessary resource utilization and network usage which will also be available with 2022.3.
  • Additional metrics, monitoring and alerting has been implemented. This is key to managing the growth as the AU deployment grows and we are rolling out the updated uniFLOW SmartClient.

Customer Actions

To resolve these issues completely it is important that the uniFLOW SmartClient is updated post the 2022.3 release scheduled for September. We will communicate this via the uniFLOW Online Status Page and via the Admin Notification Widget along with instructions.

Posted Jul 13, 2022 - 17:08 UTC

Resolved
Hello Everyone,

After an exhaustive investigation and superb support from the Microsoft engineering team we have identified the root cause and put mitigation controls in place. At this time, we are not seeing any issues reported or through our telemetry.

There will be a detailed Post-mortem published shortly. In the meantime, we can share that the root cause was down to a specific environmental configuration and the mishandling of the SmartClient. If companies have a situation where they are switching or sitting behind multiple proxies the SmartClient will create multiple connections and exhaust resources faster than expected.

We are working on updates to the SmartClient to handle this condition in our 2022.3 September release. Further details will be in the Post-mortem and what actions are required to ensure you are on the latest version.

Thanks for your patience during this investigation.

Kind Regards

uniFLOW Online Operations Team.
Posted Jul 12, 2022 - 11:30 UTC
Update
Hello Everyone,

We are seeing great results with the mitigation controls currently in place. We will continues to monitor throughout this week and provide further updated shortly.

Kind Regards
uniFLOW Online Operations Team.
Posted Jul 07, 2022 - 08:03 UTC
Monitoring
Dear all,
together with Microsoft, we have made some headway today and steps have been undertaken to mitigate the issue. We are moving this incident into a 'monitoring' state and will continue to work with Microsoft to implement a permanent solution.
Posted Jun 30, 2022 - 20:40 UTC
Update
We are continuing our investigation with Microsoft. At this point, all available logs and telemetry data point to an issue specific to the Australian (AU) Deployment (Infrastructure/Datacenter). We do not expect this issue to appear in other regions but remain vigilant.
Posted Jun 30, 2022 - 11:02 UTC
Update
Investigation with Microsoft is continuing but SmartClients are already reconnecting.
We will keep you updated about the ongoing investigation and further details.
Posted Jun 29, 2022 - 09:32 UTC
Update
Hello Everyone,

We still have this issues under investigation and working to identify the cause.

Regards
uniFLOW Online Operations Team
Posted Jun 29, 2022 - 06:21 UTC
Update
We are continuing to investigate this issue.
Posted Jun 29, 2022 - 02:33 UTC
Investigating
Hello Everyone,

Our monitoring and telemetry is indicating that SmartClients are disconnecting from our uniFLOW Online Cloud Service. This issues was initially experienced earlier in the week but resolved through improvements to resource scaling. We are currently experiencing a similar issue but not down to any resource limits.

Our operations team are working with Microsoft engineers to check the underlying Azure service infrastructure.

Important: Printing will continue to work. In the event of the SmartClient disconnecting from the cloud service it will switch to 'Emergency Mode' automatically. With this you can select a nearby printer to release your print jobs.

Please note: You can change your email notification subscription to only receive notifications affecting deployments you’d like to watch. To do this, click ‘Manage your subscription’ in the email footer of the status page email notification.

Kind Regards
uniFLOW Online Operations Team
Posted Jun 29, 2022 - 02:27 UTC
This incident affected: AU Deployment (Printing).