On Saturday, April 30th, 2022 at 6:23am EST, we hit limits in our prod2 MongoDB instance that resulted in cascading failures throughout our system. After identifying Mongo as the issue, we scaled out the cluster and the system recovered once the changes were applied.
Root Cause
There was a drastic increase in IOPS to the database, exceeding what we had reserved, during a cutover of traffic to this cluster.
04/30 6:24 am - Biz rules Worker high error rate triggers a PagerDuty call.
04/30 6:25 - 6:42 am - Additional services start to alert.
04/30 6:35 am - Per TSE team, the first customer reports running into various system issues.
04/30 6:43 am - Engineers join the warroom zoom to investigate.
04/30 7:02 am - Engineers determine we’ve reached hardware limitations with our Prod2 MongoDB cluster.
04/30 7:10 am - Prod2 MongoDB cluster resource changes applied
04/30 7:20 am - Status page updated with Investigating degraded performance
04/30 7:23 am - Customers start reporting that the issue has been resolved.
04/30 8:15 am - Announce that the issue has been resolved and we continue monitoring.
Lessons/Improvements