This is our final post-mortem incident report that we aim to publish within 14 days of incident mitigation. This PIR relates to the AskCody Platform incident active between Jan 19, 08:18 CET and Jan 30, 11:52 CET.
Between Jan 19, 08:18 CET and Jan 25, 14:24 CET all customers in Europe experienced intermittent unavailability and high load times of the AskCody platform when using our add-ins and our Management Portal. Devices showing AskCody Displays and Dashboards were affected by this as well. This was the case for customers integrating with AskCody through Exchange Online or Exchange Server alike.
We determined that the intermittent unavailability and high load times were primarily caused by the compound side effects of several Microsoft network- and storage-level incidents coinciding with Microsoft rolling out Basic Authentication deprecation in Exchange Online. On Jan 19th, as failed Exchange Online Basic Authentication attempts starting to compound, we experienced simultaneous outages across multiple Azure Service dependencies in the West Europe and North Europe regions – including Azure DNS Services, Azure Storage, Azure Application Insights, Azure Container Registry, Azure Database for MySQL, and Azure Virtual Machines (VMs). Globally other Microsoft services including Microsoft 365 suffered from long network latency and timeouts as well, throughout the incident period. Precursory network issues within the EU region, which ultimately lead to the global M365 outage on Jan 25th, added to the severity and timespan of the AskCody incident, as confirmed by the Azure Support Engineering team we’ve worked with throughout the incident. These network- and storage level Azure Service outages, unknown to both Microsoft and ourselves at the time, added to the challenge of identifying and mitigating the precise causes of the incident, while also severely hampering the automatic failover and self-healing abilities of our backend services, postponing the return of full-service availability. We continue to investigate the nature of these self-healing issues to prevent future recurrence.
The issue was detected by our primary operations team at the time. We immediately rolled back our application layer on all relevant services to their last known working state, while simultaneously monitoring the effect of the exponentially rising number of 401 responses from Exchange Online, the likely culprit at that time. As the simultaneous Azure outages within the EU West region started impacting the AskCody platform, causing event queues on Azure Storage to start responding with error 500, and Azure MySQL connections to start getting refused, we routed all EU traffic to our EU North region. This, however, caused Microsoft Azure to automatically cap and block further outbound connections from our backend services, the latter being an improbable side effect of the high number of failing Exchange Online Basic Authentication attempts.
Our primary operations team quickly reestablished traffic to both EU regions, which restored outbound connectivity, and proceeded to patch our applications and infrastructure in an effort to mitigate each network- and storage-level Azure outage as they each emerged. We did this continuously throughout the incident period, manually restoring each affected service when their ability to self-heal proved insufficient, culminating with the global Microsoft service outage at 07:05 UTC Jan 25th.
Immediately following Microsoft fully mitigating their global outage on Jan 25th 12:43 UTC, availability of all AskCody services was fully restored as well.
Following any incident, we thoroughly review the incident timeline, incident causes and mitigatory steps taken. Action items deemed both viable and feasible are implemented into our infrastructure, applications, and processes in order to prevent recurrence of similar incidents.
We are in the process of strengthening our ability to scale our infrastructure horizontally across additional Azure regions on very short notice, while also looking into additional storage redundancy options for our business-critical services. Additionally, we are currently reviewing how well our infrastructure adheres to the Azure Well-Architected Framework and will be taking steps to strengthen it where applicable.
We kindly ask customers connecting to AskCody through MS Exchange Server to follow Microsoft Exchange Server servicing recommendations. This means installing Cumulative Updates (CUs) and Security Update (SUs) on all your Exchange servers, as well as staying on Exchange Server Versions still supported by Microsoft.