Beginning at 10:00 am EST on October 14, 2021, Kustomer engineering was alerted to an issue where one of the database shards entered a bad state. At a high level, our third-party cloud database (MongoDB) was unable to process a high volume of write transactions on our messages during peak traffic hours. This issue lasted for approximately 2 hours.
During this time, inbound and outbound messages generated within the Kustomer platform entered a scheduled state. These messages were held until the database issue was cleared up and the items started to get redriven. The Kustomer system experienced latency which included the inability to access conversations and customer timelines.
This issue stemmed from a bug on the side of MongoDB regarding their server-side logic around transactions.
At approximately 11:30 AM EST, the bad shard was rectified and event redriving began. All items were redriven within 2 hours with residual events completed by 6:18 PM EST. As a quick solution, we worked with Mongo to reduce the transaction’s timeout value to abort stuck transactions quicker. This helped alleviate some stress on the cluster. As a more medium-term solution, we had to update our code to stop using transactions to avoid bringing down service for our messages.