Performance Disruption of Phrase TMS (EU) Machine Translation Component betweeen 11:40-12:30 CET
Incident Report for Phrase
Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 11:40 AM CET and 12:30 PM CET on March 7th, 2023 which led to a partial outage of the Machine Translation component and what Phrase engineers are doing to prevent these issues from happening again.

Timeline

11:43 AM CET: The automatic alert arrives, notifying the engineer on duty of an increased error rate in the Machine Translation component. The engineer on duty identifies the working threads of the component as being blocked.

11:56 AM CET: Service is restarted, but it returns to the same blocked state. Responsible engineers continue looking for the root cause.

12:20 PM CET: Previous version of the service is deployed, but only brings a temporary improvement.

12:27 PM CET: The message queue in RabbitMQ used in the Machine Translation component is purged as it contains a large amount of messages. This allows the Machine Translation component to publish messages successfully and proceed with the waiting requests.

Root Cause

RabbitMQ connector was unexpectedly re-creating a message queue which is no longer being read from, This degraded performance of the RabbitMQ and prevented the Machine Translation component from being able to publish new messages, thus exhausting its resources.

Actions to Prevent Recurrence

  • RabbitMQ connector was restarted which applied a new configuration so the obsolete message queue is not re-created anymore.
  • RabbitMQ will be soon removed from the whole process in favor of another technology.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Mar 10, 2023 - 13:59 CET

Resolved
This incident has been resolved. We are sorry for any inconvenience caused.
Posted Mar 07, 2023 - 13:43 CET
Update
We continue to investigate this issue affecting third party machine translation engines. We are sorry for any inconvenience caused.
Posted Mar 07, 2023 - 13:00 CET
Investigating
We are currently investigating this issue.
Posted Mar 07, 2023 - 12:34 CET
This incident affected: Phrase TMS (EU) (Machine translation).