The AskCody Platform Status - AskCody Platform

AskCody Platform - Microsoft Incident

Incident Report for The AskCody Platform

Postmortem

This is our final post-mortem incident report that we aim to publish within 14 days of incident mitigation. This PIR relates to the AskCody Platform incident active between Jan 19, 08:18 CET and Jan 30, 11:52 CET.

What happened?

Between Jan 19, 08:18 CET and Jan 25, 14:24 CET all customers in Europe experienced intermittent unavailability and high load times of the AskCody platform when using our add-ins and our Management Portal. Devices showing AskCody Displays and Dashboards were affected by this as well. This was the case for customers integrating with AskCody through Exchange Online or Exchange Server alike.

What went wrong and why?

We determined that the intermittent unavailability and high load times were primarily caused by the compound side effects of several Microsoft network- and storage-level incidents coinciding with Microsoft rolling out Basic Authentication deprecation in Exchange Online. On Jan 19th, as failed Exchange Online Basic Authentication attempts starting to compound, we experienced simultaneous outages across multiple Azure Service dependencies in the West Europe and North Europe regions – including Azure DNS Services, Azure Storage, Azure Application Insights, Azure Container Registry, Azure Database for MySQL, and Azure Virtual Machines (VMs). Globally other Microsoft services including Microsoft 365 suffered from long network latency and timeouts as well, throughout the incident period. Precursory network issues within the EU region, which ultimately lead to the global M365 outage on Jan 25th, added to the severity and timespan of the AskCody incident, as confirmed by the Azure Support Engineering team we’ve worked with throughout the incident. These network- and storage level Azure Service outages, unknown to both Microsoft and ourselves at the time, added to the challenge of identifying and mitigating the precise causes of the incident, while also severely hampering the automatic failover and self-healing abilities of our backend services, postponing the return of full-service availability. We continue to investigate the nature of these self-healing issues to prevent future recurrence.

How did we respond?

The issue was detected by our primary operations team at the time. We immediately rolled back our application layer on all relevant services to their last known working state, while simultaneously monitoring the effect of the exponentially rising number of 401 responses from Exchange Online, the likely culprit at that time. As the simultaneous Azure outages within the EU West region started impacting the AskCody platform, causing event queues on Azure Storage to start responding with error 500, and Azure MySQL connections to start getting refused, we routed all EU traffic to our EU North region. This, however, caused Microsoft Azure to automatically cap and block further outbound connections from our backend services, the latter being an improbable side effect of the high number of failing Exchange Online Basic Authentication attempts.

Our primary operations team quickly reestablished traffic to both EU regions, which restored outbound connectivity, and proceeded to patch our applications and infrastructure in an effort to mitigate each network- and storage-level Azure outage as they each emerged. We did this continuously throughout the incident period, manually restoring each affected service when their ability to self-heal proved insufficient, culminating with the global Microsoft service outage at 07:05 UTC Jan 25th.

Immediately following Microsoft fully mitigating their global outage on Jan 25th 12:43 UTC, availability of all AskCody services was fully restored as well.

How are we making incidents like this less likely or less impactful?

Following any incident, we thoroughly review the incident timeline, incident causes and mitigatory steps taken. Action items deemed both viable and feasible are implemented into our infrastructure, applications, and processes in order to prevent recurrence of similar incidents.

We are in the process of strengthening our ability to scale our infrastructure horizontally across additional Azure regions on very short notice, while also looking into additional storage redundancy options for our business-critical services. Additionally, we are currently reviewing how well our infrastructure adheres to the Azure Well-Architected Framework and will be taking steps to strengthen it where applicable.

How can customers make incidents like this less impactful?

We kindly ask customers connecting to AskCody through MS Exchange Server to follow Microsoft Exchange Server servicing recommendations. This means installing Cumulative Updates (CUs) and Security Update (SUs) on all your Exchange servers, as well as staying on Exchange Server Versions still supported by Microsoft.

Posted Feb 02, 2023 - 10:00 CET

Resolved

Since the 25th of January, when Microsoft implemented their fix, page load times have continued to be stable, and no errors related to the incident first reported on the 19/01/23 have been registered.

Following this monitoring period, we are proceeding to mark this incident as resolved.
A full postmortem will follow as soon as we can deliver it, explaining in as much detail as possible what has transpired.

Posted Jan 30, 2023 - 11:52 CET

Update

Page load times continue to be stable, and since the previous update, no errors related to the incident first reported on the 19/01/23 have been registered.
For good measure, we will monitor the status of the platform throughout the weekend.

On Monday our intention is to set this incident to a resolved status. A full postmortem will follow as soon as we can deliver it, explaining in as much detail as possible what has transpired.

Posted Jan 27, 2023 - 14:21 CET

Update

Page load times continue to be stable, and since the previous update, no errors related to the incident first reported on the 19/01/23 have been registered.
The other open incident, posted at the time of the recent Microsoft incident will be marked as resolved, allowing for any relevant updates to follow in this thread. https://status.askcody.com/incidents/7d7j5mw14cfp

A full post-mortem will follow as soon as we can deliver it, explaining in as much detail as possible what has transpired.

Unless any relevant information related to this incident presents itself, the next update will follow tomorrow at 11:00 AM.

Posted Jan 26, 2023 - 14:54 CET

Update

Following the fix Microsoft implemented yesterday, we have continued to experience platform stability.
Page load times are stable, and since the previous update, no errors related to the incident first reported on the 19/01/23 have been registered.

A full post-mortem will follow as soon as we can deliver it, explaining in as much detail as possible what has transpired.

Unless any relevant information related to this incident presents itself, the next update will follow at 15:00.

Posted Jan 26, 2023 - 07:08 CET

Update

We are continuing to monitor the performance of the AskCody platform.
Page load times are stable, and since the previous update, no errors related to the incident first reported on the 19/01/23 have been registered.

We will continue to update as soon as new information is available, but no later than 07:30 (UTC+1) on the 26/01/23.

Posted Jan 25, 2023 - 20:50 CET

Update

We are continuing to monitor the performance of the AskCody platform.
Page load times are stable, and since the previous update, no errors related to the incident first reported on the 19/01/23 have been identified.

We will continue to update as soon as new information is available, but no later than 21:00 (UTC+1).

Posted Jan 25, 2023 - 16:53 CET

Monitoring

Dear Customer,

We are seeing steady load times, since Microsoft started acting on the issue and started a roll-back, according to the issues announced by Microsoft this morning: https://azure.status.microsoft/en-gb/status & https://status.office365.com/.
Our entire platform has been scanned, measured, monitored, analyzed and services rebooted, as told in our morning status, and the above provides us with the confidence in that we are certain this is the cause.

We will keep this in a monitoring status and keep it here until we feel certain the implemented measures from Microsoft are in fact in place and that we are back on track with the incident first reported on Thursday the 19th of January 2023.

We are putting everything back to operational statuses because all performance metrics indicate stable performance. However, in light of the fact that this incident has been active for 6 full days, we will be extra diligent and cautious, and we will make sure to update you once we feel more confident on whether or not we are operational.

A full post-mortem will follow as soon as we can deliver it, explaining in as much detail as possible what has transpired.

Best,
Customer Experience Director, Kim Lunden Jensen

We will continue to update as soon as new information is available, but no later than 17:00 (UTC+1).

Posted Jan 25, 2023 - 14:24 CET

Update

The platform is generally accessible. Occasional slow load times and timeout errors may however still be experienced.
We are continuing to investigate the issue related to the performance of the AskCody platform, as well as the impact of the Microsoft Incident on AskCody. https://status.office365.com/

We will continue to update as soon as new information is available, but no later than 14:00 (UTC+1).

Posted Jan 25, 2023 - 11:52 CET

Update

We are currently focusing on: AskCody Platform Degraded Performance - related to Microsoft M365 dependencies.
https://status.askcody.com/incidents/7d7j5mw14cfp

We will continue to update as soon as new information is available, but no later than 12:00 (UTC+1).

Posted Jan 25, 2023 - 09:32 CET

Update

We are continuing to investigate this issue.

Posted Jan 25, 2023 - 07:26 CET

Update

Dearest Customer,

We are very aware that every passing minute, hour, and day that the AskCody Platform is not available to the extent needed to perform accordingly with your expectations, is painful and troublesome, for you, for your operations and your users. We are deeply sorry for the impact it is causing your business.

Respectfully, we can now share some more context and information on the issues and challenges that have impacted our Platform and its availability since Thursday last week.

A full Post-Mortem will be done and shared as soon as possible afterwards. Right now, we are prioritizing a solution and a fix first; then we settle and evaluate afterwards.

This information is simply to clarify what we know, by now.

What we know so far:

On Thursday the 19th, Microsoft shut off Basic Authentication for Exchange Online in Europe - Basic Authentication Deprecation in Exchange Online – Time’s Up - Microsoft Community Hub (https://techcommunity.microsoft.com/t5/exchange-team-blog/basic-authentication-deprecation-in-exchange-online-time-s-up/ba-p/3695312), Even though we, AskCody and our Customers, prepared for this, moved everyone to Modern Authentication, that event initiated a series of other events leading to the outage and extreme slow response times to the AskCody Platform to Customers in Europe. No North American customer running on our US Data Center has been impacted, since Basic Auth hasn’t been shut down yet by Microsoft in such region.

Since this change to Microsoft Exchange in Europe, on Thursday the 19th, every time our Platform connects with Microsoft Exchange (even though we’re using Modern Auth, SSO or accessing data via Microsoft Graph that theoretically shouldn’t be impacted), the connection times-out or is put on hold in a que until getting a 401 response. Our build in algorithms forces our Platform to retry to connect automatically, which again fails as a consequence of the secondary, indirect, or cumulative effects of Microsoft shutting down Basic Auth.

Despite the shutdown of Basic Auth, it should not have impacted any customer using Modern Auth, but for some reason, it has caused connections with Microsoft Exchange to timeout, and simply lead to way to high load times on our Platform, again leading to the Platform being unavailable due to the high load, generated by false, unauthorized requests to Exchange.

How this can happen, and the full extent of the consequences, are still being investigated and will be shared the second we know more. The issue has been escalated and raised with Microsoft, and right now we have an open case running with Microsoft to get knowledge on what has changed.

Our entire platform has been scanned, measured, monitored, analyzed and services rebooted. Since Thursday, our Application Layer, our infrastructure, and our storage have been fully functioning and running – despite the Platform being inaccessible from an end-user perspective. The reason for the inaccessibility is due to the high load times, forced by bad request to Exchange, forced by indirect, secondary effects on Microsoft shutting down Basic Auth. Therefore, our normal recovery procedures for rolling back to the last know steady state, has not been a solution, since our entire application layer and infrastructure has been running as in normal operations.

By Friday and Monday, we rolled out, and updated, how we access data in Microsoft Exchange to leave out and remove all bad requests to ensure no connection, good or bad, just lies in a queue, forcing overload to the Platform. This has already had a significant impact on availability and speed.

Today, we further scaled the updates to mitigate some potential bottlenecks, that we discovered, with the maximum pressure that has been put on that Platform due to false request by Exchange in Europe.

In combination with these series of events and the consequences of these, our Azure Load Balancer has been off too, causing another series of issues in a perfect storm, where traffic has been routed to the wrong datacenter. This has been resolved too, and we are awaiting a response from Microsoft on how this could happen, as well.

Monitoring everything in the afternoon on Tuesday indicates that we have mitigated and resolved the challenges, to get us back to normal operations. We will not conclude anything until we’ve seen normal operations for a full workday.

This is not an attempt to push blame or instigate anything towards Microsoft. This is for you, our customers, to try and help you with what we know so far, and although nothing is set in stone it is what we know so far.

A full post-mortem will be done when everything is fully solved.

We will continue to update you with the same frequency, until it is all resolved, and we will make sure to provide you with a clear explanation once resolved.

In the meantime, please do not hesitate to reach out, if you feel we can add anything that helps you communicate this and helps you relieve some of the pressure we know this causes. We are happy to jump on to any call 24/7 to explain the situation and what we are doing to solve it, based on our knowledge and insight available.

Everyone is working tirelessly to resolve the situation and we hope to deliver a full, available solution soon. Right now, we are monitoring the latest fixes deployed on Monday and Tuesday, to understand the impact. We can see repones time is going dramatically down, and the number of bad requests too,

The second we have any news; we’ll share it openly with you.

Best,

Kim Lunden Jensen, Director of Experience.

As for status on the platform:
The platform is generally accessible. Occasional slow load times and timeout errors may however still be experienced.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 09:30 (UTC+1).

Posted Jan 25, 2023 - 07:06 CET

Update

The platform is generally accessible. Occasional slow load times and timeout errors may however still be experienced.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 07:10 (UTC+1) on the 25/01/2023.

Posted Jan 24, 2023 - 20:50 CET

Update

The platform is generally accessible. Occasional slow load times and load errors are possible.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 21:00 (UTC+1).

Posted Jan 24, 2023 - 18:52 CET

Update

The platform is generally accessible. Occasional slow load times and timeout errors may be experienced.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 21:00 (UTC+1).

Posted Jan 24, 2023 - 18:42 CET

Update

Posted Jan 24, 2023 - 16:53 CET

Update

The platform is currently slow to load, though generally accessible. Occasional load errors are possible.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 17:00 (UTC+1).

Posted Jan 24, 2023 - 14:38 CET

Update

Posted Jan 24, 2023 - 13:13 CET

Update

The platform is currently accessible. Slow loading and load errors may still be encountered.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 13:30 (UTC+1).

Posted Jan 24, 2023 - 11:46 CET

Update

The platform is generally very slow to access with occasional load errors possible.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 12:00 (UTC+1).

Posted Jan 24, 2023 - 10:22 CET

Update

The platform is generally inaccessible with only occasional access possible.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 10:30 (UTC+1).

Posted Jan 24, 2023 - 08:49 CET

Update

We are continuing to investigate the issue related to the performance of the AskCody platform.
The platform is currently accessible. Occasional slow loading or load errors may be encountered.

We will continue to update as soon as new information is available, but no later than 10:00 (UTC+1).

Posted Jan 24, 2023 - 08:29 CET

Update

Posted Jan 24, 2023 - 06:51 CET

Update

Posted Jan 23, 2023 - 20:48 CET

Update

The platform is currently accessible. Occasional slow loading or load errors may be encountered.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 21:00 (UTC+1).

Posted Jan 23, 2023 - 18:57 CET

Update

We are continuing to investigate the issue related to the performance of the AskCody platform.
The platform is currently accessible; however, slow loading and few load errors may be encountered.

We will continue to update as soon as new information is available, but no later than 19:00 (UTC+1).

Posted Jan 23, 2023 - 16:51 CET

Update

We are continuing to investigate the issue related to the performance of the AskCody platform.
The platform is currently accessible; however, slow loading and load errors may be encountered.

We will continue to update as soon as new information is available, but no later than 17:00 (UTC+1).

Posted Jan 23, 2023 - 15:16 CET

Update

The platform is currently accessible. Slow loading and load errors may be encountered.
We are continuing to investigate the issue related to the performance of the AskCody platform.

We will continue to update as soon as new information is available, but no later than 15:30 (UTC+1).

Posted Jan 23, 2023 - 13:52 CET

Update

We are continuing to investigate the issue related to the performance of the AskCody platform.
The platform is currently accessible. Slow loading and load errors may be encountered.

We will continue to update as soon as new information is available, but no later than 14:00 (UTC+1).

Posted Jan 23, 2023 - 12:25 CET

Update

Posted Jan 23, 2023 - 10:44 CET

Update

Dearest customer,
We are aware of the pain and frustration the platform challenges are causing right now. Our team has been working since Thursday on solving the issues and we are doing everything we can to get it all up and running again as fast as possible.
We will continue to provide you with running updates, to let you know that we are actively working on solving these issues and to provide you with insight as the situation develops.
Once everything is restored, we will continue to be alert and monitor the platform and we will deliver a Postmortem to explain everything after resolution.
We apologise for the pain this causes you.

Posted Jan 23, 2023 - 09:30 CET

Update

We are continuing to investigate the issue related to the performance of the AskCody platform.
The platform is generally unavailable.

We will continue to update as soon as new information is available, but no later than 11:00 (UTC+1).

Posted Jan 23, 2023 - 09:13 CET

Update

We are continuing to investigate the issue related to the performance of the AskCody platform.
When using the platform you will experience slow load times and possible timeouts.

We will continue to update as soon as new information is available, but no later than 09:30 (UTC+1).

Posted Jan 23, 2023 - 08:03 CET

Update

We are continuing to investigate the issue and working on a hypothesis currently.
The impact of these efforts is expected to be conclusive on Monday.

We will continue with regular updates on the morning of 23-01-23. Before then, in case of any outages on the platform, we will make sure to update you.

Posted Jan 20, 2023 - 17:59 CET

Update

Load times have varied since the last update, but they are not back to normal levels.
We are continuing to investigate the issue.

We will continue to update as soon as new information is available, but no later than 18:00 (UTC+1).

Posted Jan 20, 2023 - 15:27 CET

Update

We are continuing to investigate the issue.
The platform is currently available; however, load times can be slow.

We will continue to update as soon as new information is available, but no later than 15:30 (UTC+1).

Posted Jan 20, 2023 - 13:54 CET

Update

We are continuing to investigate the issue.
The platform is currently available; however, load times can be slow.

- The "Slow load times" incident created this morning was determined to relate to this one and will be merged from here.

We will continue to update as soon as new information is available, but no later than 14:00 (UTC+1).

Posted Jan 20, 2023 - 12:25 CET

Update

Posted Jan 20, 2023 - 11:22 CET

Update

We are continuing to investigate the issue.
The platform is currently available; however, load times are slow.

We will continue to update as soon as new information is available, but no later than 11:30 (UTC+1).

Posted Jan 20, 2023 - 10:11 CET

Investigating

The AskCody platform is experiencing a major outage.
We are currently investigating the issue.

We will continue to update as soon as new information is available, but no later than 10:30 (UTC+1)

Posted Jan 20, 2023 - 09:15 CET

Update

We are currently experiencing an increase in load times across the EU side of the AskCody platform.

Posted Jan 20, 2023 - 07:46 CET

Monitoring

Normal operation has been restored for the platform and all affected components.
We will continue working to confirm causes as well as monitor performance over the coming days.

Posted Jan 19, 2023 - 15:06 CET

Update

We are continuing the investigation of the issue.
We will provide the next update before 15:05 (UTC+1).

Posted Jan 19, 2023 - 14:04 CET

Update

While we continuously get more elements of the issue confirmed, we are not locked down on a single cause. We are keeping the investigation open to uncover the root cause and get the issue resolved.
We will provide the next update before 14:05 (UTC+1)

Posted Jan 19, 2023 - 13:09 CET

Update

Work to mitigate the issue as well as determine the cause and resolution is ongoing.
Refreshing the portal page can, for some, grant sporadic access to the portal, but we still deem this a Major Outage.
The success rate is up for loading the add-ins, but it is not on target. Displays and Dashboards should show calendar events, but on-screen interactions have errors. Check-in Kiosk should load, but have errors for check-ins.
We will provide the next update before 13:05 (UTC+1)

Posted Jan 19, 2023 - 12:03 CET

Update

While some operation has been restored and sporadic acces is possible, this is still a major incident to us and work to determine the cause and resolution is ongoing.
We will provide the next update before 12:05 (UTC+1)

Posted Jan 19, 2023 - 11:05 CET

Update

We are continuing to investigate the issue. We will strive to provide hourly updates on the status of the major outage.

Posted Jan 19, 2023 - 10:13 CET

Update

We are working on a lead to what we believe to be the cause of major outage across the platform. We are, however, still investigating the issue. The outage is affecting both the AskCody Management Portal, as well as the Add-ins.

Posted Jan 19, 2023 - 09:17 CET

Investigating

A major outage across the AskCody platform is currently being investigated.

Posted Jan 19, 2023 - 08:18 CET

This incident affected: Visitor Management (Europe) (Outlook Add-in, Visitor Management Portal, Check-in kiosk), Meeting Services (Europe) (Outlook Add-in, Meeting Services Portal), and Room Booking (Europe) (Outlook Add-in, Room Management Portal, Meeting Dashboards, Room Displays, Workplace Central, Mobile App (iOS), Mobile App (Android)).