What happened?
Between 02:50 UTC and 04:05 UTC, 17:17 UTC and 18:14 UTC on 01/28/2023, customers in the Singapore region experienced issues with login and authentication with Automation cloud.
What went wrong and why?
At 02:50 UTC on 01/28/2023, a scheduled deployment was made to the regional gateway in Singapore. Due to a misconfiguration, certain requests were unable to reach our Identity service. Our alerting system caught this issue and the engineering team fixed the configuration at 04:05 UTC, which mitigated the issue.
Unfortunately, at 17:17 UTC on 01/28/2023, the same deployment was mistakenly re-run causing a repeat outage, the same mitigation was applied at 18:14 UTC and resolved the issue.
How are we making incidents like this less likely or less impactful?