We would like to share more details about the events that occurred with Memsource between 8:25 and 9:45 CEST on October 7th, 2021 which led to a partial performance degradation of all Memsource components and what Memsource engineers are doing to prevent these issues from happening again.
Wed 6th October 15:00 CEST: A new version of the database cleaner is deployed to production servers.
Wed 6th October 16:00 CEST: The database cleaner periodically runs new complex queries which start consuming available burst IO capacity on one of the database volumes.
Thu 7th October 7:41 CEST: Deployment of a new version of the Memsource service starts, invalidates the local cache and subsequently increases the number of database IO operations.
Thu 7th October 8:20 CEST: Available burst IO capacity is completely consumed on one of the database volumes; only the baseline IO capacity is now available for processing incoming database queries.
Thu 7th October 8:25 CEST: The database load is too high, increasing the response time of user requests and slowing down some parts of the Memsource service. The service becomes unavailable for some users.
Thu 7th October 8:26 CEST: Automated monitoring starts reporting the slow response of the system. Memsource engineers start looking for the cause of the problem.
Thu 7th October 8:48 CEST: Some servers are reconfigured to disable unnecessary database requests to decrease the database load; Memsource engineers continue looking for the root cause.
Thu 7th October 9:16 CEST: The database cleaner and consumed burst IO capacity are identified as the root cause.
Thu 7th October 9:18 CEST: The database cleaner is paused and IO capacity on the disk volume is increased; the database load quickly returns back to normal.
Thu 7th October 9:28 CEST: Database load and response time of the system are returning back to normal.
Thu 7th October 9:40 CEST: The incident is resolved.
Processing of complex queries in a new version of the database cleaner created large temporary database tables on a database volume dedicated to creating such tables. Processing such queries required increasingly demanding performance which led to a gradual consumption of all available burstable IO capacity of the volume. There was no automated alert set up for an incident of this type. Exhaustion of the burst IO capacity was accelerated by clearing the cache during the deployment of a new version of the system. The result of this chain of events was many queued database queries which could not be processed fast enough leading to a degradation of the service response time.
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.