Issues synchronizing Outlook.com Calendars

Incident Report for Cronofy

Postmortem

All Outlook.com calendars experienced a major loss of functionality for at least 40 hours. During this period we were operating purely from our cache of their schedule.

As Microsoft’s product naming can be confusing, this only affected Outlook.com Microsoft’s more consumer-orientated offering formerly known as Hotmail and Live.com over the years. Microsoft 365 and on-premise Exchange were unaffected.

Timeline

On Tuesday 1st March at around 23:00 UTC it appears that Microsoft made a change to their infrastructure which meant all our requests to interact with Outlook.com calendars began to fail.

By Thursday 3rd March at around 15:00 UTC we managed to restore service for roughly 90% of Outlook.com calendars, approximately 40 hours later.

Without any success from efforts to communicate with Microsoft, on Friday 4th March around 09:30 UTC we decided to take more drastic action to give the remaining 10% of Outlook.com calendar users a route to restoring their service by implementing a new mechanism for authorizing Cronofy’s access to their calendar. This was made available around 16:30 UTC the same day, Friday 4th March.

The remaining 10% of Outlook.com calendars received a notification that they needed to reauthorize Cronofy’s access to their calendar by Friday 4th March 23:00 UTC.

Investigation to resolution

By far the most disappointing part of this incident was how long it took us to notice there was an issue. With hindsight we had received informational severity alerts shortly after 23:00 UTC on Tuesday when the issue started but this was missed by the team.

For background, at Cronofy we have three levels of alert:

Informational
Review soon
Look now

Informational alerts are delivered to a Slack channel and can cover things not needing any direct attention. This can be from an area we are interested in keeping a further eye on, or early signs of a potential issue. The next level is "review soon", these go into PagerDuty as a low severity alert that is assigned to an on-call engineer, generally for review the next working day. The highest level is "look now" where an on-call engineer is paged regardless of the time of day to investigate. Often the idea of informational and review soon alerts is to provide more color around the impact of a "look now" alert which may be triggered by a single metric.

It took until our support team received a couple of support tickets on Thursday morning (UK time) relating to Outlook.com calendars and flagged it to our engineering team for us to realise the extent of the problem. This was roughly 36 hours after the start of the issue.

Once the extent of the issue had been recognized, this public facing incident was opened.

We quickly identified we were consistently receiving 503 Service Unavailable responses from Microsoft. This response code is usually indicative of a temporary issue on the service provider's side which we just have to wait out. However, as we had been seeing this for over 36 hours at this point we worked on the assumption there was something under our control that could resolve the issue. Therefore we started running experiments in alterations to our integration which may help whilst attempting to reach someone at Microsoft that may be able to resolve the underlying issue.

Various sets of changes were attempted but unsuccessful until we found mention of an optional header when reviewing Microsoft's API documentation that we could add to our requests, specifically x-AnchorMailbox. This seemed promising as a 503 statuses are often returned by load balancers or firewalls responsible for routing requests to the correct place, headers like x-AnchorMailbox are often helpful for load balancers or firewalls to more easily route requests to the correct location.
The addition of this header using the account's email address sprung the sync of a large number of Outlook.com calendars to life at around 15:00 UTC on Thursday. We were premature in announcing this had resolved the issue for all Outlook.com calendars, instead it was closer to 90% of Outlook.com calendars.

Further efforts were made to resolve the problem for the remaining 10% of Outlook.com calendars but none bore fruit. We were able to identify that a large majority of the calendars still experiencing issues were using a custom domain for their account, but not all. Our theory was that we needed to provide the ID of the mailbox for the x-AnchorMailbox header due to the presence of custom domains, but this ID was not available through any of the endpoints already at our disposal from the authentication tokens we had for these users.

At this point we were into the evening for the team and chose to pause our experimentation and regroup in the morning. We were at a cross-roads where we were facing the need for some drastic intervention, and we did not want to take that decision lightly. Therefore, we chose to continue trying to get a resolution from Microsoft overnight before making the call. Our integration for Outlook.com calendars had been unchanged for a long period prior to Tuesday and so we were optimistic something could be reverted on their side to fix the remaining 10% without need for drastic action on our part.

Come Friday morning, 09:00 UTC, we had not had a resolution from Microsoft and the remaining 10% of Outlook.com calendars were still unable synchronize their schedules. Therefore we defined and began to execute on a contingency plan to replace our authorization mechanism for Outlook.com calendars.

This was ready to go around 15:30 UTC at which point we made the call to move forward with the switch. The change was deployed around 16:00 UTC and enabled at 16:15 UTC. Around 15 minutes later we deployed a further change that would start sending the remaining 10% of Outlook.com calendars that were still experiencing issues through our relinking process. This would give us a fresh set of credentials via the new mechanism which provided us with the ID of the mailbox, not just their email address, and we expected this would resolve the issue for these remaining Outlook.com calendars.

Roughly 15 minutes later we saw someone from that cohort reconnect their Outlook.com calendar and the synchronization with their calendar become healthy again, validating our theory.

We continued to monitor and saw further successes building our confidence that all people with Outlook.com calendars now had a route to a successful synchronization link, albeit after their intervention in some cases.

The following morning, after a review of the current status, we closed the incident.

Opportunities for improvement

By far the most significant problem within this incident was the missing high severity alerting around Outlook.com calendars. This alerting has been put in place and was already in place for all the other calendar services we support, Outlook.com had unfortunately been missed.

A contributing factor to the length of time until we identified there was an incident was the timing of the informational alerts we did receive. Our engineering team is based in the UK and Europe so by 23:00 UTC no-one is actively working and so then skim the informational alerts posted overnight the following morning. This timing and process led to no-one spotting that the Outlook.com informational alerts did not have a corresponding closure message.

To this end, we are also looking more holistically at our alerting to avoid such things slipping through the cracks in future.

Specifically we are looking at:

Refining informational severity alerts that have a tendency to briefly flicker to reduce noise within alerts where no resolution is possible, eg. a side effect of ephemeral network issues and the following retry succeeding.
Providing visibility of informational alerts that have been open for a significant period.

Both of these aim to reduce the possibility of similar alerts being missed by reducing the noise around them and increasing their signal over time. This will mean that unless alerting is entirely absent, which should never be the case, it is much less likely it will go unnoticed for anywhere near as long.

We are comfortable that the time from identification to resolution of this incident was reasonable given the nature of the issue. Roughly 90% of Outlook.com calendars were successfully synchronizing within 4 hours of our investigation starting, with the remaining 10% of Outlook.com calendars being given a path to successful synchronization after we quickly turned around a major change the following day.

Our deployment pipeline and tooling enabled us to investigate and experiment safely and rapidly towards the eventual solution to this issue.

Whilst we communicated clearly during the incident, we did not meet our internal guidance on how frequently we provided status updates. For example, we should have provided an update by 10:00 UTC on the Friday to make it clear we were still working on the incident but did not post an update until after 13:00 UTC, nearly 20 hours after the previous update. We will be updating our internal guidance around communication, with a focus on multi-day incidents.

‌

If you have any further questions, please contact us at support@cronofy.com.

Posted Mar 11, 2022 - 16:14 UTC

Resolved

The Outlook.com calendar accounts we could not resolve ourselves have now been asked to relink and our overall success rate for syncing Outlook.com calendars has returned to normal levels.

This incident is now resolved and we will be conducting a postmortem of this incident and will share our finding by the end of next week.

Please contact support@cronofy.com if you have any questions.

Posted Mar 05, 2022 - 11:52 UTC

Update

The new version of our Outlook.com calendar authorization process does appear to be allowing users who were still affected by the incident to synchronize again.

Roughly half of the cohort we expect to be asked to relink have now been asked.

We are continuing to monitor but now expect any person experiencing issues who relinks their Outlook.com account to have their synchronization issues resolved.

We expect to close this issue tomorrow after all relink requests have been sent and we have monitored the situation for a while longer.

Please contact support@cronofy.com if you have any further questions.

Posted Mar 04, 2022 - 19:02 UTC

Update

We have released a new version of our Outlook.com calendar authorization process that we believe will allow users still affected by the incident to synchronize again. Unfortunately this will require users to reauthorize Cronofy's access to their Outlook.com calendar, as we have been unable to find a solution that would avoid this for the remaining calendars.

The Outlook.com accounts still encountering errors will receive relink emails over the coming hours requesting they reauthorize Cronofy's access to their calendar. This should then resolve their calendar synchronization problems. Only those whose synchronization was not fixed by yesterday's change will be required to relink.

We are continuing to monitor the affected users as this process takes place.

Please contact support@cronofy.com if you have any further questions at this time.

Posted Mar 04, 2022 - 16:40 UTC

Update

We’re still working on this incident as a priority. From our observations, we see that from 15:00 UTC yesterday, service would have resumed for most Outlook.com users.

We are working on restoring services for the remainder of the affected users.

Please do get in touch with us at support@cronofy.com with any questions.

Posted Mar 04, 2022 - 13:29 UTC

Update

It appears that Microsoft have made a change to the Outlook.com API we are using.

To be clear both Microsoft 365 domains and on-premise Exchange calendars are unaffected by this issue.

The earlier fix for Outlook.com involved adding a previously optional header to our requests, and resolved the issue for a little over 90% of Outlook.com accounts (not everyone as early signs indicated).

Further attempts to work around issue have so far been unsuccessful. We are attempting to get assistance from Microsoft on the underlying issue.

Please contact support@cronofy.com if you have any further questions or have customers that are still affected.

Posted Mar 03, 2022 - 17:59 UTC

Monitoring

We now believe this incident to be resolved. We will continue monitoring to ensure there are no outstanding issues.

Please contact support@cronofy.com if you have any further questions at this time.

Posted Mar 03, 2022 - 15:20 UTC

Update

Upon further investigation, we have identified that a change made by Microsoft has impacted our ability to sync Outlook.com profiles.

We are exploring possible workarounds to this issue, whilst liaising with Microsoft Support for further assistance.

If you have any further questions at this time please reach out to support@cronofy.com.

Posted Mar 03, 2022 - 13:59 UTC

Identified

We have identified an issue affecting some Outlook.com calendars. Customers may observe this issue as delays or failures when synchronizing with Outlook.com calendars.

We are investigating and will update you as we progress.

Posted Mar 03, 2022 - 12:14 UTC

This incident affected: Major Calendar Providers (Outlook.com).