We would like to share more details about the events that occurred with Memsource between 03:05 PM CET and 05:30 PM CET on November 29th, 2021 which led to a gradual outage of the Project Management service and what Memsource engineers are doing to prevent these issues from happening again.
03:05 PM CET: Automated monitoring triggers an alert indicating a large number asynchronous requests in the processing queue. Memsource engineers start investigating the problem.
The initial investigation suggests that there is an unusually large number of incoming asynchronous requests sent by various customers at the same time; requests are being processed correctly. Engineers monitor the situation.
03:22 PM CET: The number of received asynchronous requests keeps increasing which slows down request processing for Memsource users.
03:36 PM CET: A customer’s integration is identified as the source of the large number of requests. Memsource engineers cut off the integration by disabling the API endpoints for the user. Memsource support agents contact the customer.
The number of asynchronous requests is stabilized. The system is operational and queued requests are being processed. The File Processing service is reported as being slower than usual by some customers.
04:00 PM CET: Memsource engineers identify that queued requests created by the cut off customer integration are duplicates and their slow processing causes a high database load. The problematic requests are manually terminated by Memsource engineers to unblock the processing of other users’ requests and speed up the File Processing service.
04:59 PM CET: The high database load slows down processing of user requests which leads to the exhaustion of the database connection pool. Memsource service becomes inoperational for a short period of time.
05:03 PM CET: The database connections are freed and all services are slowly returning back to normal.
05:06 PM CET: The Memsource service is fully operational and responsive.
Root Cause
A large number of asynchronous requests sent in a short period of time by a customer’s broken integration significantly degraded the performance of the File Processing service. Memsource engineers manually terminated duplicated asynchronous requests sent by the broken integration. Finalization of the huge number of terminated requests resulted in the high database load and exhausted the connection pool which led to a short outage of the Memsource service.
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.