Incident Report - 2023-02-07
On January 31, 2023, Duo received alerts of Duo Directory Synchronization failing for multiple customers. The Duo Engineering team paused the deployment of release D258 to limit customer impact while they investigated. The Engineering team’s investigation identified a D258 code change in the Duo core authentication service that caused a conflict with the Duo Admin Panel service that had not yet been updated to D257. Engineering deployed a code fix and resumed deployment of D258, reaching all impacted customer deployments on February 1, 2023.
2023-01-31 16:32 Duo Site Reliability Engineering (SRE) is informed of customers having problems running directory syncs via email. SRE begins triage.
2022-02-28 17:25 Status page updated to: “We have identified the cause of the issue and we are implementing a fix.”
2023-01-31 17:40 Root cause identified by the Engineering team.
2023-01-31 19:16 Fix implemented to our codebase.
2023-02-01 12:50 Fix deployed across our impacted customers.
2023-02-01 13:50 Status page updated, all our systems are operational.
On January 31, 2023 at 16:32 EST, Duo’s Site Reliability Engineering (SRE) team received monitoring alerts about multiple customers having problems running Duo Directory Synchronization with either OpenLDAP or Active Directory. Duo's Engineering Team paused the deployment of release D258 to limit customer impact while they investigated.
By January 31 at 17:40 EST, Duo Engineering traced the failed directory sync root cause to a D258 change to Duo's core authentication service. When a customer on Duo core D258 was not yet updated to Duo Admin Panel service D258, and that customer started a directory sync, the sync failed.
Engineering deployed a code fix to release D258 and resumed deployment. The fix reached all impacted customer deployments by February 1 at 12:50 EST.
Because the error could only occur when Duo core and admin services were on these two different release versions, Engineering determined that the resolution was to repair the code identified as the root cause, then allow deployment to finish, resulting in all services on the same version and allowing the root cause to "self-heal".
The only customers at risk of impact by this incident were those who executed a Duo Directory Sync with OpenLDAP or Active Directory during the deployment of release D258 (from Thursday, January 26 until Engineering paused deployment on January 31). Six customers reported failed directory syncs during this release window. Customers who experienced failed directory sync will have experienced automatic retry of their directory sync, which will have succeeded once D258 completed deployment.
Root cause analysis identified the need for stronger API version testing in Duo’s continuous integration pipeline and an opportunity to improve Directory Sync monitoring. These measures will increase the likelihood of identifying problems before they impact customers.