We would like to share more details about the events that occurred with Phrase TMS between 11:40 PM CET and 01:47 AM CET on March 23rd, 2023 which led to a gradual outage of all Phrase TMS (EU) components and what Phrase engineers are doing to prevent these issues from happening again.
11:40 PM CET: First alerts arrive to engineers on duty and are followed by an immediate outage of TMS. Engineers start investigating the issues and assessing the situation.
11:59 PM CET: TMS services recovered and engineers continue to assess the situation and analyze the presumed cause.
00:51 AM CET: Second outage of TMS caused by the not yet fully recovered database.
01:50 AM CET: TMS recovered and preventive measures are applied.
02:25 AM CET: Incident is officially concluded as resolved and all TMS services are confirmed to be operating normally.
The database was overloaded by a large number of complex queries within a very short timeframe leading to the outage of connected services. The overloaded database server impacted the networking stack of the operating system. Shortly after the first outage, this network malfunction led to the second outage.
Finally, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.