Web UI inaccessible for 48 minutes (9:49am - 10:37am), and streams page inaccessible for an additional 48 minutes (until 11:25am).
09:49 AM PT: 100% of API requests for the webapp begin failing
09:52 AM PT: Database hits 100% resource utilization
10:31 AM PT: Divert all traffic from webapp to bring up database
10:37 AM PT: Allow all traffic except from the operation/get endpoint so the webapp is back up
11:25 AM PT: Allow operation/get traffic and incident is resolved
Database errors led to transient unavailability. Aggressive retries saturated the database, leading to a negative feedback loop.
We have updated our database configuration to limit the blast radius of failing or slow database calls. We have also implemented additional rate limiting to avoid the negative feedback loop where many concurrent requests lead to failures, leading to more requests.