Customer Impact
On Jan 25th 2023, starting from 7:05 AM UTC till 9:00 AM UTC, our Enterprise and Community services were inaccessible for some users and some other users would have experienced increased latency.
Root Cause
Our cloud services run on Azure. During this time, connections to our services from internet had increased latency or failed altogether.
Here is the excerpt from Azure status page on why this happened:
Azure made a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute. As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices.
Microsoft has committed to providing a full RCA next week. We will post again once we have more details.
Detection & Response
Our automated alerting system notified on call engineers about the problem within minutes of the outage impacting our applications. We investigated all possible scenarios and zeroed on Azure as potential cause. As soon as we identified the issue with Azure, we opened high severity case with Azure and they made us aware about the network/DNS problem.
All our services are configured for high availability with service components deployed in two different Azure regions in a geo. Since this issue was global, we couldn't employ quick mitigation and had to wait for Azure to fix the problem.