14 October 2021
Introduction
We would like to share more details about the events that occurred with Memsource between 12:58 CEST and 02:42 PM CEST on October 14th, 2021 which led to a gradual outage of the Project Management component and what Memsource engineers are doing to prevent these issues from happening again.
12:58 CEST: Automated monitoring triggers an alert indicating slow response times of the Project Management component. Memsource engineers start investigating the problem.
13:02 CEST: Slow Project Management affects the responsiveness of other Memsource components.
13:18 CEST: High database load is identified as the cause of slow responsiveness; some servers are reconfigured to disable unnecessary database requests to decrease the database load.
13:50 CEST: Memsource components are returning to normal. Memsource engineers disable some servers to speed up the recovery of the component. Memsource is operational but may be slower for some users.
14:05 CEST: Memsource engineers commence a controlled restart of some servers to speed up their recovery.
14:36 CEST: All servers are recovered; responsiveness of the Memsource component returns to normal.
Root Cause
The database server was close to the configured global query capacity when incoming user requests triggered many high performance database queries. It led to the database server running out of available capacity, slowing down the processing of requests and exhausting the connection pool for a short period of time. Pending requests accumulated during this time were being processed as the database capacity was gradually freed, which slowed down the recovery of the system. The system recovered after processing all pending requests.
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.