We would like to share more details about the events that occurred with Phrase between 11:40 AM CET and 12:30 PM CET on March 7th, 2023 which led to a partial outage of the Machine Translation component and what Phrase engineers are doing to prevent these issues from happening again.
11:43 AM CET: The automatic alert arrives, notifying the engineer on duty of an increased error rate in the Machine Translation component. The engineer on duty identifies the working threads of the component as being blocked.
11:56 AM CET: Service is restarted, but it returns to the same blocked state. Responsible engineers continue looking for the root cause.
12:20 PM CET: Previous version of the service is deployed, but only brings a temporary improvement.
12:27 PM CET: The message queue in RabbitMQ used in the Machine Translation component is purged as it contains a large amount of messages. This allows the Machine Translation component to publish messages successfully and proceed with the waiting requests.
Root Cause
RabbitMQ connector was unexpectedly re-creating a message queue which is no longer being read from, This degraded performance of the RabbitMQ and prevented the Machine Translation component from being able to publish new messages, thus exhausting its resources.
Finally, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.