Issues Creating Messages in Kustomer (Prod1)

Incident Report for Kustomer

Postmortem

Summary

Beginning at 10:00 am EST on October 14, 2021, Kustomer engineering was alerted to an issue where one of the database shards entered a bad state. At a high level, our third-party cloud database (MongoDB) was unable to process a high volume of write transactions on our messages during peak traffic hours. This issue lasted for approximately 2 hours.

Impact / Alerts

During this time, inbound and outbound messages generated within the Kustomer platform entered a scheduled state. These messages were held until the database issue was cleared up and the items started to get redriven. The Kustomer system experienced latency which included the inability to access conversations and customer timelines.

Root Cause

This issue stemmed from a bug on the side of MongoDB regarding their server-side logic around transactions.

Resolution

At approximately 11:30 AM EST, the bad shard was rectified and event redriving began. All items were redriven within 2 hours with residual events completed by 6:18 PM EST. As a quick solution, we worked with Mongo to reduce the transaction’s timeout value to abort stuck transactions quicker. This helped alleviate some stress on the cluster. As a more medium-term solution, we had to update our code to stop using transactions to avoid bringing down service for our messages.

Lessons/Improvements

We have deployed system code to rectify this issue should it occur again.
Continue to engage our third-party cloud database to resolve the defect on their end.

Posted Oct 15, 2021 - 16:56 EDT

Resolved

The issues with latency and messages sending from Kustomer (Prod1) has been resolved.

Please reach out to our Support team with any additional questions. You can reach us by going to https://help.kustomer.com/ and clicking "Contact Support" at the top of the page.

Posted Oct 14, 2021 - 18:23 EDT

Monitoring

At this time, the errors associated with messages and overall latency have dropped significantly and we are observing normal or near normal behavior within the platform. Kustomer will continue to monitor the situation to ensure that all components are functioning correctly, and we will share another update here when the incident is fully resolved.

Posted Oct 14, 2021 - 11:34 EDT

Identified

The Kustomer team has identified the issue and is actively working to get the scheduled items to process.

You may start to see messages being sent and received but we are continuing to work on resolving all latency.

We will continue to share information here as it becomes available.

Posted Oct 14, 2021 - 11:18 EDT

Investigating

Kustomer is currently seeing issues with messages in Kustomer for clients in Prod1. This is also impacting the ability to send messages.

We are working to resolve the issue as quickly as possible. During this time you may experience issues on messages not being created and sending messages.

Please refer to updates as we continue to work on the issue on the Kustomer Status Page.

Posted Oct 14, 2021 - 10:47 EDT

This incident affected: Prod1 (US) (Channel - Chat, Channel - Email, Channel - Facebook, Channel - Instagram, Channel - SMS, Channel - Twitter, Channel - WhatsApp, Web/Email/Form Hooks).