Log-in Service Outages
Incident Report for Datto
Postmortem

Issue: On April 7, 2021 the systems engineering team promoted a kernel patch to core systems. This patch had been tested for an extended period of time against staged servers and observed with no impact, however 2.5 hours into the patching process monitoring alerts began to come in for several of our core services. The patch caused a kernel panic and put the servers affected into a locked-up state which required manual intervention to reboot, and rollback patch-levels.

Timeline/Impact Analysis The initial patching of the staging environment occurred in the final week of March, and was completed and in testing as of March 31,2021. The environment was observed for a period of one week before there was an internal approval to begin patching our production environment. On April 7, 2021 the patch-level promotion was pushed to our fleet of servers. 2.5 hours into the patching process alerting began and the team immediately acted to determine impact and identify the issue. Once determined that it was the patching event, there was a rollback of the promoted patch-level for the fleet. The rollback occurred 18 minutes after initial alerting began, however several core services were already impacted and bringing them back online was what caused the extended downtime. Specifically, manual intervention was required to reboot affected VMs and servers that did receive the patch-level promotion, and had a kernel-panic event.

Corrective Action/Prevention Plan The process behind how patch-level promotion roll-outs has been reevaluated, as has the order of critical core services in the promotion process. Specifically, there will be a growing list of servers exposed to the patch-level promotion with several more testing periods and all critical core services will be done in a phased approach on the tail end of the patching processes.

Posted May 05, 2021 - 19:12 UTC

Resolved
This incident has been resolved.
Posted Apr 09, 2021 - 18:42 UTC
Update
DNA checkin and web access services are now fully operational.

All Datto Services are now fully operational.

We will continue to monitor the status over the course of the day to confirm there are no additional problems.
Posted Apr 09, 2021 - 16:04 UTC
Monitoring
All services have been restored at this time.

Our DNA checkin and web access services are accessible however some bandwidth data and metrics will continue to be unavailable our Engineering team will be focusing their efforts to fully restore this service.
Posted Apr 08, 2021 - 13:02 UTC
Update
We have restored SaaS services.

We will continue to provide updates as additional services are restored
Posted Apr 08, 2021 - 03:21 UTC
Update
We have restored Phone and Contact Center services.

We will continue to provide updates as additional services are restored
Posted Apr 08, 2021 - 02:03 UTC
Update
We have restored BCDR and Cloud Continuity Recovery services.

We will continue to provide updates as additional services are restored
Posted Apr 08, 2021 - 01:10 UTC
Update
We have received reports that our internal Contact Center and Phone system have also been impacted by this outage and are in a degraded state.

Our Engineering team will investigate these reports as part of this outage and we will provide status updates as they arise.
Posted Apr 08, 2021 - 00:55 UTC
Update
We have restored Device, Off-site Synchronization, Remote Web, Network Manager, and Datto Commerce Services.

We will continue to provide updates as additional services are restored.
Posted Apr 08, 2021 - 00:46 UTC
Update
We have restored services to the Partner Portal and our Authentication services.

We will continue to provide updates as additional services are restored.
Posted Apr 07, 2021 - 23:57 UTC
Identified
We are currently aware of an outage to our Log-in Access Services such as our Partner Portal, Recovery Launchpad, Datto Store, help.datto.com, Networking Manager, as well as other services

Our Engineering team has identified the root-cause of the problem is actively working to restore these services.

Currently there is no ETA on when these service will be restored.

We will continue to update the status of this outage as new information arises.
Posted Apr 07, 2021 - 23:26 UTC
Update
Additional SaaS Services have been identified as impacted and are being updated to reflect this
Posted Apr 07, 2021 - 21:45 UTC
Update
We are continuing to investigate this issue.
Posted Apr 07, 2021 - 20:59 UTC
Update
Additional Networking Services have been identified as impacted and are being updated to reflect this.
Posted Apr 07, 2021 - 20:57 UTC
Update
Additional BCDR Services have been identified as impacted and are being updated to reflect this.
Posted Apr 07, 2021 - 20:56 UTC
Investigating
We are currently aware of an outage to our Log-in Access Services such as our Partner Portal, Recovery Launchpad, Datto Store, help.datto.com, Networking Manager, and some devices.

Our Engineering team is currently investigating the root cause of the problem.

Currently there is no ETA on when these service will be restored.

We will update the status of this outage as new information arises.
Posted Apr 07, 2021 - 20:32 UTC
This incident affected: Datto Networking (Network Manager), Datto Login Services, Partner Portal, Datto Phones (Phone Portal (Contact Center), Support Phones), Datto SaaS Protection (SaaS Protection Console Login, SaaS Protection Seat Management, SaaS Protection Client Onboarding), Datto BCDR (Device Checkin, Off-Site Synchronization, Recovery, Remote Web), and Cloud Continuity (Recovery).