Performance Disruption of Phrase TMS (EU) Term Base and CAT Web Editor Components between 1:24 PM CEST and 4:10 PM CEST

Incident Report for Phrase

Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 11:24 AM/PM CEST and 04:10 AM/PM CEST on October 10th, 2022 which led to which led to the gradual unavailability of the Phrase TMS (EU) Term Base and what Phrase engineers are doing to prevent these issues from happening again.

Timeline

11:00 AM CEST: Maintenance of Elasticsearch cluster started to apply a performance configuration which was related to preventing a similar incident as from last week.

11:24 AM CEST: Engineers noticed the memory consumption of the Elasticsearch cluster is unusually high and gradually degrading cluster performance.

11:35 AM CEST: Cluster master node was restarted and the cluster voted a new master node including rebalance. Cluster is working as expected.

01:00 PM CEST: Cluster was extended with additional nodes and fielddata cache was cleared to reduce heap usage across the cluster. Engineers proceed with a rolling restart of the cluster.

02:00 PM CEST: Engineers noticed the memory consumption of the Elasticsearch cluster is unusually high and gradually degrading cluster performance.

02:20 PM CEST: Engineers extend the cluster again with updated Elasticsearch memory settings on part of the nodes. Cluster is now operational.

03:00 PM CEST: Cluster scaling is completed.luster is healthy and engineers are monitoring the situation. Performance metrics are returning to normal.

04:10 PM CEST: Incident is closed, all metrics are healthy now.

10:00 PM CEST: Rolling restart of cluster with increased Elasticsearch memory and fielddata cache is successfull without any impact on performance.

Root Cause

A rolling restart of the Elasticsearch cluster caused severe memory and CPU pressure as the cluster was under-scaled for such an operation. While nodes were being restarted, cache eviction and rebuild caused search queries to be delayed, propagating the issue to dependent components.

Actions to Prevent Recurrence

Cluster was scaled to improve stability in rolling restart situations.
Elasticsearch memory and cache settings were updated to lower memory pressure in similar situations.
Engineers will investigate on how to further optimize Elasticsearch cluster maintenance and scaling processes to prevent such situations in the future.

Conclusion

Finally, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.

Posted Oct 19, 2022 - 09:06 CEST

Resolved

This incident has been resolved.

Posted Oct 10, 2022 - 16:10 CEST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 10, 2022 - 15:08 CEST

Identified

We investigated the issue affecting the Term Base and CAT Web Editor components and identified the root cause. We are currently working on a fix to solve the issue.

Posted Oct 10, 2022 - 13:24 CEST

This incident affected: Phrase TMS (EU) (CAT web editor, Term base).