We would like to share more details about the events that occurred with Memsource between 2:51 and 4:16 PM CET on the 18th of March, 2021 which led to the disruption of identification and scoring of non-translatable segments, degraded performance of pre-translation and analyses and what Memsource engineers are doing to prevent these sorts of issues from happening again.
3:02 PM CET: Monitoring of the frontend component reports slower responses.
3:04 PM CET: Engineers reveal that the slowness of one API endpoint originated in the AI component providing non-translatable segment scores.
3:08 PM CET: Engineers from all relevant teams are investigating the problem and a root cause.
3:24 PM CET: Problem is identified; very long segments are being sent to the AI component.
3:38 PM CET: In order to lower the load on the AI component, the API endpoint providing non-translatable segment scoring is disabled.
3:50 PM CET: There is no significant improvement in the response time of the AI component. The team is evaluating other solutions.
4:00 PM CET: The problem is caused by the automatic pre-translation of a specific document. As there are no similar documents in the queue and the problematic one is close to being finished, it was decided to let the system finish the pre-translation.
4:14 PM CET: The AI component is operating normally.
4:16 PM CET: The API endpoint providing non-translatable segments scoring is enabled.
The responses of the AI component were slowed down by the processing of some very long segments.
As a reaction to the problems:
Finally, we want to apologize. We know how critical our services are to your business. Memsource as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Memsource engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determine how to make changes that improve our services and processes.