Errors in Multiple Services
Incident Report for UiPath
Postmortem

Customer impact

On September 23, 2022, from 6:08 PM until 8:22 PM UTC, customers accessing UiPath Automation Cloud, Automation Hub, and all other UiPath cloud services saw an outage to some functionalities across all regions. The outage prevented UiPath services from being able to interact with one another.

Document Understanding

Document Understanding was almost completely down. Tagging inside an already opened document session and calling OCR or already published extractor endpoints remained unimpacted.

Automation Hub

Automation Hub was completely down for all FPS tenants (log in through Automation Cloud). Automation Hub Classic users were NOT impacted.

Data Service

Data Service was completely down for all tenants' requests from activities. Tenants' requests from the cloud were not impacted.

Notification Service

We have observed partial failure during this time frame. We noticed eleven accounts that could have observed undelivered notifications.

Connections Service

Connection Service UI was completely down for any users who had logged in during this time span, as well as any users redirected to maintain or create new connections/triggers, (i.e. from Studio, Apps, etc). Trigger dispatch requests to Orchestrator were delayed.

Hypervisor

No impact on the ability to run jobs on existing pools or Elastic Robot Orchestration workflows. Limited impact on users’ ability to create new Automation Cloud Robot VMs.

Root cause

There were multiple overlapping issues. One of our Azure Active Directory applications is only used to run non-critical operations in deployment pipelines. During a deployment, the secret of this application expired. This combination uncovered a bug that accidentally wiped out the secret hashes that some of our services use to authenticate to one another.

After root causing the issue, we were able to restore the secret hashes with a secret renewal for the Azure AD application and re-deployment. After the mitigation, we validated that all functionality was completely restored.

Detection

We had multiple alerts across the stack for failures being observed across multiple UiPath products within minutes.

Response

Engineers across multiple services engaged quickly to start investigating the issue. Because of authentication failures due to missing secret hashes, we needed to navigate through secrets to identify the problem that was happening and figure out a reliable process to restore the secrets to mitigate the incident.

Follow up

We understand how incredibly impactful and unacceptable this incident is and apologize deeply. We are continuously taking steps to improve the UiPath cloud platform and our processes to help ensure such incidents do not occur in the future.  In this case, this includes (but is not limited to):

  • Do a deep post-mortem of how this incident could have been prevented, and how we could have minimized the impact of this incident.
  • Fix the original issue in our deployment pipeline that wiped out the secrets when key vault access issues occur
  • Better pre-requisite validation before deployment operations.
  • Better tracking and alerting of expiring secrets, even for non-critical scenarios.
Posted Sep 27, 2022 - 14:22 UTC

Resolved
This incident has been resolved.
Posted Sep 23, 2022 - 20:51 UTC
Monitoring
The fix has now been deployed to all services. Errors have dropped to 0. We will continue to monitor to ensure that all scenarios are working.
Posted Sep 23, 2022 - 20:27 UTC
Identified
The issue has been identified and we are working on a fix. The error rate for Document Understanding has begun to fall, we will continue to monitor it while we roll out the fix for the remaining products.
Posted Sep 23, 2022 - 20:20 UTC
Update
We are continuing to investigate this issue.
Posted Sep 23, 2022 - 20:10 UTC
Update
After further investigation, we've found more impacted services.
We are continuing to work on a mitigation.
Posted Sep 23, 2022 - 20:06 UTC
Update
We are continuing to investigate this issue.
Posted Sep 23, 2022 - 20:00 UTC
Investigating
A piece of our underlying infrastructure is failing. This is causing errors across many of our services.
Document Understanding is down globally.
Insights is partially down. Customers are unable to provision, but other scenarios are functional.
Posted Sep 23, 2022 - 19:00 UTC
This incident affected: Automation Hub, AI Center, Apps, Data Service, Document Understanding, Insights, and Task Mining.