Degraded Performance of API and Editor for Web Components

Incident Report for Phrase (formerly Memsource)

Postmortem

Introduction

We would like to share more details about the events that occurred with Memsource between 2:51 and 4:16 PM CET on the 18th of March, 2021 which led to the disruption of identification and scoring of non-translatable segments, degraded performance of pre-translation and analyses and what Memsource engineers are doing to prevent these sorts of issues from happening again.

Timeline

3:02 PM CET: Monitoring of the frontend component reports slower responses.

3:04 PM CET: Engineers reveal that the slowness of one API endpoint originated in the AI component providing non-translatable segment scores.

3:08 PM CET: Engineers from all relevant teams are investigating the problem and a root cause.

3:24 PM CET: Problem is identified; very long segments are being sent to the AI component.

3:38 PM CET: In order to lower the load on the AI component, the API endpoint providing non-translatable segment scoring is disabled.

3:50 PM CET: There is no significant improvement in the response time of the AI component. The team is evaluating other solutions.

4:00 PM CET: The problem is caused by the automatic pre-translation of a specific document. As there are no similar documents in the queue and the problematic one is close to being finished, it was decided to let the system finish the pre-translation.

4:14 PM CET: The AI component is operating normally.

4:16 PM CET: The API endpoint providing non-translatable segments scoring is enabled.

Root Cause

The responses of the AI component were slowed down by the processing of some very long segments.

Actions to Prevent Recurrence

As a reaction to the problems:

Very long segments will automatically be routed to a rule-based processor instead of a neural network.
The AI model will be optimized for the handling of a higher rate of requests.
An automated fail-safe mechanism will be implemented to prevent the AI component from causing system-wide issues in cases of overload.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.

Posted Mar 22, 2021 - 09:29 CET

Resolved

This incident has been resolved.

Posted Mar 18, 2021 - 16:36 CET

Update

We are continuing to monitor for any further issues.

Posted Mar 18, 2021 - 16:19 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 18, 2021 - 16:18 CET

Investigating

Our engineering team is investigating the degraded performance of the Pre-translation and Non-translatable components.

Posted Mar 18, 2021 - 15:58 CET

This incident affected: Memsource TMS (EU) (API, Editor for Web).