All Outlook.com calendars experienced a major loss of functionality for at least 40 hours. During this period we were operating purely from our cache of their schedule.
As Microsoft’s product naming can be confusing, this only affected Outlook.com Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years. Microsoft 365 and on-premise Exchange were unaffected.
On Tuesday 1st March at around 23:00 UTC it appears that Microsoft made a change to their infrastructure which meant all our requests to interact with Outlook.com calendars began to fail.
By Thursday 3rd March at around 15:00 UTC we managed to restore service for roughly 90% of Outlook.com calendars, approximately 40 hours later.
Without any success from efforts to communicate with Microsoft, on Friday 4th March around 09:30 UTC we decided to take more drastic action to give the remaining 10% of Outlook.com calendar users a route to restoring their service by implementing a new mechanism for authorizing Cronofy’s access to their calendar. This was made available around 16:30 UTC the same day, Friday 4th March.
The remaining 10% of Outlook.com calendars received a notification that they needed to reauthorize Cronofy’s access to their calendar by Friday 4th March 23:00 UTC.
By far the most disappointing part of this incident was how long it took us to notice there was an issue. With hindsight we had received informational severity alerts shortly after 23:00 UTC on Tuesday when the issue started but this was missed by the team.
For background, at Cronofy we have three levels of alert:
Informational alerts are delivered to a Slack channel and can cover things not needing any direct attention. This can be from an area we are interested in keeping a further eye on, or early signs of a potential issue. The next level is "review soon", these go into PagerDuty as a low severity alert that is assigned to an on-call engineer, generally for review the next working day. The highest level is "look now" where an on-call engineer is paged regardless of the time of day to investigate. Often the idea of informational and review soon alerts is to provide more color around the impact of a "look now" alert which may be triggered by a single metric.
It took until our support team received a couple of support tickets on Thursday morning (UK time) relating to Outlook.com calendars and flagged it to our engineering team for us to realise the extent of the problem. This was roughly 36 hours after the start of the issue.
Once the extent of the issue had been recognized, this public facing incident was opened.
We quickly identified we were consistently receiving 503 Service Unavailable responses from Microsoft. This response code is usually indicative of a temporary issue on the service provider's side which we just have to wait out. However, as we had been seeing this for over 36 hours at this point we worked on the assumption there was something under our control that could resolve the issue. Therefore we started running experiments in alterations to our integration which may help whilst attempting to reach someone at Microsoft that may be able to resolve the underlying issue.
Various sets of changes were attempted but unsuccessful until we found mention of an optional header when reviewing Microsoft's API documentation that we could add to our requests, specifically x-AnchorMailbox
. This seemed promising as a 503 statuses are often returned by load balancers or firewalls responsible for routing requests to the correct place, headers like x-AnchorMailbox
are often helpful for load balancers or firewalls to more easily route requests to the correct location.
The addition of this header using the account's email address sprung the sync of a large number of Outlook.com calendars to life at around 15:00 UTC on Thursday. We were premature in announcing this had resolved the issue for all Outlook.com calendars, instead it was closer to 90% of Outlook.com calendars.
Further efforts were made to resolve the problem for the remaining 10% of Outlook.com calendars but none bore fruit. We were able to identify that a large majority of the calendars still experiencing issues were using a custom domain for their account, but not all. Our theory was that we needed to provide the ID of the mailbox for the x-AnchorMailbox
header due to the presence of custom domains, but this ID was not available through any of the endpoints already at our disposal from the authentication tokens we had for these users.
At this point we were into the evening for the team and chose to pause our experimentation and regroup in the morning. We were at a cross-roads where we were facing the need for some drastic intervention, and we did not want to take that decision lightly. Therefore, we chose to continue trying to get a resolution from Microsoft overnight before making the call. Our integration for Outlook.com calendars had been unchanged for a long period prior to Tuesday and so we were optimistic something could be reverted on their side to fix the remaining 10% without need for drastic action on our part.
Come Friday morning, 09:00 UTC, we had not had a resolution from Microsoft and the remaining 10% of Outlook.com calendars were still unable synchronize their schedules. Therefore we defined and began to execute on a contingency plan to replace our authorization mechanism for Outlook.com calendars.
This was ready to go around 15:30 UTC at which point we made the call to move forward with the switch. The change was deployed around 16:00 UTC and enabled at 16:15 UTC. Around 15 minutes later we deployed a further change that would start sending the remaining 10% of Outlook.com calendars that were still experiencing issues through our relinking process. This would give us a fresh set of credentials via the new mechanism which provided us with the ID of the mailbox, not just their email address, and we expected this would resolve the issue for these remaining Outlook.com calendars.
Roughly 15 minutes later we saw someone from that cohort reconnect their Outlook.com calendar and the synchronization with their calendar become healthy again, validating our theory.
We continued to monitor and saw further successes building our confidence that all people with Outlook.com calendars now had a route to a successful synchronization link, albeit after their intervention in some cases.
The following morning, after a review of the current status, we closed the incident.
By far the most significant problem within this incident was the missing high severity alerting around Outlook.com calendars. This alerting has been put in place and was already in place for all the other calendar services we support, Outlook.com had unfortunately been missed.
A contributing factor to the length of time until we identified there was an incident was the timing of the informational alerts we did receive. Our engineering team is based in the UK and Europe so by 23:00 UTC no-one is actively working and so then skim the informational alerts posted overnight the following morning. This timing and process led to no-one spotting that the Outlook.com informational alerts did not have a corresponding closure message.
To this end, we are also looking more holistically at our alerting to avoid such things slipping through the cracks in future.
Specifically we are looking at:
Both of these aim to reduce the possibility of similar alerts being missed by reducing the noise around them and increasing their signal over time. This will mean that unless alerting is entirely absent, which should never be the case, it is much less likely it will go unnoticed for anywhere near as long.
We are comfortable that the time from identification to resolution of this incident was reasonable given the nature of the issue. Roughly 90% of Outlook.com calendars were successfully synchronizing within 4 hours of our investigation starting, with the remaining 10% of Outlook.com calendars being given a path to successful synchronization after we quickly turned around a major change the following day.
Our deployment pipeline and tooling enabled us to investigate and experiment safely and rapidly towards the eventual solution to this issue.
Whilst we communicated clearly during the incident, we did not meet our internal guidance on how frequently we provided status updates. For example, we should have provided an update by 10:00 UTC on the Friday to make it clear we were still working on the incident but did not post an update until after 13:00 UTC, nearly 20 hours after the previous update. We will be updating our internal guidance around communication, with a focus on multi-day incidents.
If you have any further questions, please contact us at support@cronofy.com.