Investigating degraded performance
Incident Report for Kustomer
Postmortem

Summary

On Saturday, April 30th, 2022 at 6:23am EST, we hit limits in our prod2 MongoDB instance that resulted in cascading failures throughout our system. After identifying Mongo as the issue, we scaled out the cluster and the system recovered once the changes were applied.

Root Cause

There was a drastic increase in IOPS to the database, exceeding what we had reserved, during a cutover of traffic to this cluster. 

Timeline

04/30 6:24 am - Biz rules Worker high error rate triggers a PagerDuty call.

04/30 6:25 - 6:42 am - Additional services start to alert.

04/30 6:35 am - Per TSE team, the first customer reports running into various system issues.

04/30 6:43 am - Engineers join the warroom zoom to investigate.

04/30 7:02 am - Engineers determine we’ve reached hardware limitations with our Prod2 MongoDB cluster.

04/30 7:10 am - Prod2 MongoDB cluster resource changes applied

04/30 7:20 am - Status page updated with Investigating degraded performance

04/30 7:23 am - Customers start reporting that the issue has been resolved.

04/30 8:15 am - Announce that the issue has been resolved and we continue monitoring.

Lessons/Improvements

  • Added auto-scaling on prod2 MongoDB cluster and improved monitoring to catch these issues sooner
  • Added scheduled auto-scaling for prod2
  • Provisioned additional  capacity for prod2-alb-api
  • Adjusted scaling policies for affected services 
  • Enabled new redis clusters in prod2 to handle additional capacity
Posted Jun 08, 2022 - 18:45 EDT

Resolved
This incident has been resolved.
Posted Apr 30, 2022 - 08:15 EDT
Monitoring
A fix has been implemented and the issue has been resolved.
Posted Apr 30, 2022 - 07:52 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 30, 2022 - 07:48 EDT
Investigating
We identified an issue related to platform latency. While latency has subsided, we continue to monitor the situation.
Posted Apr 30, 2022 - 07:20 EDT
This incident affected: Prod2 (EU) (Channel - Chat).