Summary

‌

On Saturday, April 30th, 2022 at 6:23am EST, we hit limits in our prod2 MongoDB instance that resulted in cascading failures throughout our system. After identifying Mongo as the issue, we scaled out the cluster and the system recovered once the changes were applied.

‌

Root Cause

There was a drastic increase in IOPS to the database, exceeding what we had reserved, during a cutover of traffic to this cluster.

Timeline

04/30 6:24 am - Biz rules Worker high error rate triggers a PagerDuty call.

04/30 6:25 - 6:42 am - Additional services start to alert.

04/30 6:35 am - Per TSE team, the first customer reports running into various system issues.

04/30 6:43 am - Engineers join the warroom zoom to investigate.

04/30 7:02 am - Engineers determine we’ve reached hardware limitations with our Prod2 MongoDB cluster.

04/30 7:10 am - Prod2 MongoDB cluster resource changes applied

04/30 7:20 am - Status page updated with Investigating degraded performance

04/30 7:23 am - Customers start reporting that the issue has been resolved.

04/30 8:15 am - Announce that the issue has been resolved and we continue monitoring.

‌

Lessons/Improvements

Added auto-scaling on prod2 MongoDB cluster and improved monitoring to catch these issues sooner
Added scheduled auto-scaling for prod2
Provisioned additional capacity for prod2-alb-api
Adjusted scaling policies for affected services
Enabled new redis clusters in prod2 to handle additional capacity

Posted Jun 08, 2022 - 18:45 EDT

Resolved

This incident has been resolved.

Posted Apr 30, 2022 - 08:15 EDT

Monitoring

A fix has been implemented and the issue has been resolved.

Posted Apr 30, 2022 - 07:52 EDT

Identified

The issue has been identified and a fix is being implemented.

Posted Apr 30, 2022 - 07:48 EDT

Investigating

We identified an issue related to platform latency. While latency has subsided, we continue to monitor the situation.

Posted Apr 30, 2022 - 07:20 EDT

This incident affected: Prod2 (EU) (Channel - Chat).