At 10:27 UTC on 15 April, 2021 we observed elevated latencies and errors in our US-East point of presence, affecting all services. The initial cause was a spike in usage in the Access Manager service. This was quickly resolved but cascading effects based on how the system handles errors and retries caused a continuation of the issues as these moved downstream. The system resolved the issues without intervention by 10:43 UTC though it took much longer than designed because of the cascading issues.
In the short term we are considering safeguards on errors and retries for the Access Manager service that will speed the resolution if a similar issue were to occur. Also, in coming sprints the team will look to re-architect the approach the system uses for Access Manager errors in order to prevent a bottleneck from moving downstream.