Summary
From 15:20 UTC to 16:42 UTC on December 15, 2021, customers on US-hosted Duo deployments experienced issues accessing the Duo service. During this window, the majority of Duo services were intermittently inaccessible to affected customers, impacting user authentications and Admin Panel access.
This was the result of networking issues affecting one of our cloud infrastructure providers (AWS) and the us-west-1, us-west-2, us-gov-west-1 AWS regions which Duo leverages.
Confirmed affected Duo deployments:
DUO1, DUO2, DUO4, DUO5, DUO6, DUO7, DUO9, DUO10, DUO12, DUO13, DUO14, DUO15, DUO16, DUO17, DUO18, DUO19, DUO20, DUO21, DUO22, DUO23, DUO24, DUO26, DUO28, DUO31, DUO32, DUO33, DUO34, DUO35, DUO36, DUO37, DUO39, DUO40, DUO41, DUO42, DUO43, DUO44, DUO45, DUO46, DUO49, DUO50, DUO51, DUO52, DUO55, DUO56, DUO58, DUO59, DUO60, DUO61, DUO62, DUO63, DUO64, DUO65
Timeline 2021-12-15:
15:20 UTC - Duo Engineering receives availability alerts from production deployments and begins investigating.
15:22 UTC - Duo Engineering identifies that multiple US-based Duo deployments are intermittently unreachable and experiences difficulty in accessing other AWS-hosted cloud services.
15:30 UTC - Duo Engineering confirms that the entirety of multiple AWS regions are affected and that the impact is not limited to Duo and begins incident response.
15:32 UTC - Duo Engineering continues assessing impact while experiencing difficulty accessing some of the systems we rely upon for understanding service health.
15:48 UTC - Duo Engineering confirms that all US-based Duo deployments are affected.
15:51 UTC - Duo Engineering continues monitoring for signs for recovery.
15:57 UTC - Duo Engineering observes that authentication traffic is starting to normalize.
16:11 UTC - Duo posts initial Status Page update for incident.
16:15 UTC - Duo Engineering confirms that the majority of Duo services are showing signs of recovery.
16:27 UTC - Duo updates the Status Page incident to Monitoring.
16:30 UTC - Duo Engineering identifies ongoing issues with Azure Conditional Access, on-premises Active Directory Sync, and Duo Single Sign-On and continues investigating.
16:42 UTC - Duo Engineering confirms Azure Conditional Access services are fully restored.
22:42 UTC - Duo updates incident status on status.duo.com to Resolved with additional information on verifying Duo Authentication Proxy service connectivity.
Details
Duo utilizes many premier cloud partners as part of our SaaS platform, including Amazon AWS. Per Amazon’s public status page ([https://status.aws.amazon.com/](https://status.aws.amazon.com)), AWS experienced network issues specific to the us-west-1, us-west-2, and us-gov-west-1 AWS regions on 2021-12-15. This issue affected connectivity to infrastructure hosted within the affected regions. Below is AWS’s summary of the incident:
"Between 7:14 AM PST (15:14 PM UTC) and 7:59 AM PST (15:59 PM UTC), customers experienced elevated network packet loss that impacted connectivity to a subset of Internet destinations. Traffic within AWS Regions, between AWS Regions, and to other destinations on the Internet was not impacted. The issue was caused by network congestion between parts of the AWS Backbone and a subset of Internet Service Providers, which was triggered by AWS traffic engineering, executed in response to congestion outside of our network. This traffic engineering incorrectly moved more traffic than expected to parts of the AWS Backbone that affected connectivity to a subset of Internet destinations. The issue has been resolved, and we do not expect a recurrence."
Duo’s platform spans multiple AWS regions and availability zones for redundancy. Within each region, we have redundancy across multiple availability zones (AZs). All infrastructure is configured in an Active / Active or Active / Passive topology with automatic recovery capabilities to ensure no single points of failure exist.
In addition to redundancy across multiple AZs within each region, we also leverage cross-region replication where possible. For example, data stores in the us-west-1 and us-west-2 AWS regions are replicated in realtime to us-east-1 to enable recovery efforts.
Because this was a multi-region outage impacting both us-west-1 and us-west-2, recovery options were more limited than in a single-region failure scenario. After determining the root cause, we estimated that executing Disaster Recovery procedures to restore services in us-east-1 would be more disruptive to customers than waiting for AWS to resolve the networking issues.
Duo Azure Conditional Access services were down during the same time period, but took longer to come back online, with functionality being fully restored at 16:42 UTC. After further investigation and collaboration with AWS, we have confirmed that this was due to additional AWS infrastructure issues that were related to, but distinct from the overarching network connectivity problems.
A software defect in the Duo Authentication Proxy caused some Authentication Proxies to not properly re-establish connectivity with the Duo authentication service even after the AWS connectivity issues were resolved. Because Duo Single Sign-On and on-premises Active Directory Sync rely upon the Authentication Proxy, these capabilities continued to fail for impacted customers until the affected Authentication Proxies were restarted by customer administrators. Duo plans to notify customers via email who we believe need to restart the Authentication Proxy service. This information was also provided on 2021-12-15 via Duo’s Status Page.
Opportunities for Improvement
Prompt incident identification and communication are primary areas of concern. We know service availability is vital to our customers and prompt communication helps our customers make informed decisions. We apologize and look to improve in the following areas:
Improve the time to identify an incident:
Status Page updates and communication to our stakeholders:
Our Duo Integrations team is already working on a fix for the software defect found in the Duo Authentication Proxy that caused some Authentication Proxies to not properly re-establish connectivity.
Failmode is a configuration that is available as an integration level setting for some applications, which deals with behavior when Duo services cannot be reached. We have received feedback about unexpected Failmode behavior and are actively working on this. In the meantime, Duo’s Business Continuity Guide is our best resource for helping customers operate effectively in the event of an outage, and has more details about how Failmodes work for our various applications.
Improving resilience is top of mind at Duo. The Duo team will use data collected during this incident to influence future infrastructure-related decisions regarding platform resilience. Our architectural improvements for outages look to improve our resiliency, limit blast radius and failure domains, and manage appropriate complexity. Our work will continue to improve situations where an entire region is degraded, as happened in this incident, and automate recovery.