Disruption of All Phrase TMS (EU) Components Between 0:52 AM CET and 1:47 AM CET
Incident Report for Phrase
Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase TMS between 11:40 PM CET and 01:47 AM CET on March 23rd, 2023 which led to a gradual outage of all Phrase TMS (EU) components and what Phrase engineers are doing to prevent these issues from happening again.

Timeline

11:40 PM CET: First alerts arrive to engineers on duty and are followed by an immediate outage of TMS. Engineers start investigating the issues and assessing the situation.

11:59 PM CET: TMS services recovered and engineers continue to assess the situation and analyze the presumed cause.

00:51 AM CET: Second outage of TMS caused by the not yet fully recovered database.

01:50 AM CET: TMS recovered and preventive measures are applied.

02:25 AM CET: Incident is officially concluded as resolved and all TMS services are confirmed to be operating normally.

Root Cause

The database was overloaded by a large number of complex queries within a very short timeframe leading to the outage of connected services. The overloaded database server impacted the networking stack of the operating system. Shortly after the first outage, this network malfunction led to the second outage.

Actions to Prevent Recurrence

  • Maximum number of database connections will be changed to prevent an overload.
  • Existing rate limiting mechanisms will be reviewed and updated.
  • Monitoring will be extended to reveal problems within the networking stack.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.

Posted Mar 24, 2023 - 10:29 CET

Resolved
This incident has been resolved.
Posted Mar 23, 2023 - 02:23 CET
Update
We are continuing to investigate this issue.
Posted Mar 23, 2023 - 01:47 CET
Investigating
We are currently investigating this issue related to the unavailability of all Phrase TMS (EU) components.
Posted Mar 23, 2023 - 01:15 CET
This incident affected: Phrase TMS (EU) (API, CAT web editor, File processing, Machine translation, Project management, Term base, Translation memory).