UiPath Services are failing intermittently in all the regions where our services are deployed

Incident Report for UiPath

Postmortem

Customer Impact

On Jan 25th 2023, starting from 7:05 AM UTC till 9:00 AM UTC, our Enterprise and Community services were inaccessible for some users and some other users would have experienced increased latency.

Root Cause

Our cloud services run on Azure. During this time, connections to our services from internet had increased latency or failed altogether.

Here is the excerpt from Azure status page on why this happened:

Azure made a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices.

Microsoft has committed to providing a full RCA next week. We will post again once we have more details.

Detection & Response

Our automated alerting system notified on call engineers about the problem within minutes of the outage impacting our applications. We investigated all possible scenarios and zeroed on Azure as potential cause. As soon as we identified the issue with Azure, we opened high severity case with Azure and they made us aware about the network/DNS problem.

All our services are configured for high availability with service components deployed in two different Azure regions in a geo. Since this issue was global, we couldn't employ quick mitigation and had to wait for Azure to fix the problem.

Posted Jan 27, 2023 - 15:52 UTC

Resolved

We are resolving this incident as all our services are working fine. We sincerely apologize for the inconvenience this incident may have caused.

Posted Jan 25, 2023 - 10:22 UTC

Monitoring

All services are back to normal as per our telemetry. But, please be noted that we could experience issues until Azure rollback WAN updates completely. More details are at https://status.azure.com/en-gb/status.

Posted Jan 25, 2023 - 09:53 UTC

Update

The backend azure services which were affected are recovering, we are checking the health of our applications before marking them operational

Posted Jan 25, 2023 - 09:41 UTC

Update

We are working with Azure. We will post an update as soon as we have new information.

Posted Jan 25, 2023 - 09:03 UTC

Identified

We suspect it is due to a DNS issue currently going on with Azure. We will provide more information as and when we have it.

Posted Jan 25, 2023 - 08:24 UTC

Update

We are continuing to investigate this issue.

Posted Jan 25, 2023 - 08:06 UTC

Update

We are continuing to investigate this issue.

Posted Jan 25, 2023 - 07:48 UTC

Update

We are continuing to investigate this issue.

Posted Jan 25, 2023 - 07:48 UTC

Investigating

We are currently investigating the issue.

Posted Jan 25, 2023 - 07:47 UTC

This incident affected: Automation Cloud, Orchestrator, Automation Hub, AI Center, Action Center, Apps, Automation Ops, Computer Vision, Cloud Robots - VM, Customer Portal, Data Service, Documentation Portal, Document Understanding, Insights, Integration Service, Marketplace, Process Mining, Task Mining, Test Manager, and Serverless Robots.