We would like to share more details about the events that occurred with Memsource between
which led to a gradual outage of the Project Management service and what Memsource engineers are doing to prevent these issues from happening again.
10:55 AM CET: Automated monitoring triggers an alert indicating slow response times of the Project Management service. Memsource engineers start investigating the problem.
It is immediately recognized as a high load DB problem impacting all Memsource Services.
11:07 AM CET: Affected servers are restarted to free up used memory and reclaim DB connections.
11:15 AM CET: Memsource services are returning to normal. Memsource engineers disable some servers to speed up the service recovery. Memsource is operational but may be slower for some users.
11:27 AM CET: A runaway script from one customer is identified as the source of problems. The problematic API endpoint is cut off for the specific user and the customer is contacted.
11:57 AM CET: All servers are recovered and Memsource service responsiveness returns to normal.
06:55 AM CET: Automated monitoring triggering alerts indicating high DB usage and slow response time.
06:58 AM CET: Memsource on-duty engineers quickly recognizing similar patterns and investigating queries and endpoints overloading the DB.
07:15 AM CET: The API user overloading the system is identified and the API manually disabled for the user. Affected servers are restarted to reclaim the memory and DB connections. Some users are still impacted by slow response times and elevated error rates.
07:25 AM CET: Response times and error rates are back to normal levels.
03:38 AM CET: Automated monitoring triggering alerts similar problems are indicated - slow response time, elevated response error rate and an overloaded database.
03:46 AM CET: Memsource on-duty engineers started investigating the issue.
03:53 AM CET: Running queries causing database overload were cancelled, their respective API calls are blocked for the related user and the customer was contacted. Database load drops immediately, response time and error rates fallback to acceptable level for most users.
06:27 AM CET: Automated monitoring raising alarms for elevated response error rate and slow response times again.
06:32 AM CET: Memsource on-duty engineers identified the problematic queries and API endpoint, the user is blocked. Most of the requests are now ok, however a small number of users are still experiencing slow response times. Servers are removed from load balancing and gradually restarted.
06:56 AM CET: All Memsource services are running with usual response times.
Root Cause
In all cases the root cause was similar: some project related queries were overloading the capacity of the database server. The queries, users and API were not the same in all cases.
Shortly after the number of problematic requests reached the critical threshold, the database server reached maximum capacity and started blocking other queries leading to a cascade of timeouts and errors, eventually leading to degraded performance of all Memsource components as the majority of operations are dependent on the project database.
The queries were blocking the database server with an unexpected number of parallel queries; queries were suboptimal or the size of data in specific tables changed dramatically with the database server not able to effectively plan query execution.
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.