We would like to share more details about the events that occurred with Phrase between 11:24 AM/PM CEST and 04:10 AM/PM CEST on October 10th, 2022 which led to which led to the gradual unavailability of the Phrase TMS (EU) Term Base and what Phrase engineers are doing to prevent these issues from happening again.
11:00 AM CEST: Maintenance of Elasticsearch cluster started to apply a performance configuration which was related to preventing a similar incident as from last week.
11:24 AM CEST: Engineers noticed the memory consumption of the Elasticsearch cluster is unusually high and gradually degrading cluster performance.
11:35 AM CEST: Cluster master node was restarted and the cluster voted a new master node including rebalance. Cluster is working as expected.
01:00 PM CEST: Cluster was extended with additional nodes and fielddata cache was cleared to reduce heap usage across the cluster. Engineers proceed with a rolling restart of the cluster.
02:00 PM CEST: Engineers noticed the memory consumption of the Elasticsearch cluster is unusually high and gradually degrading cluster performance.
02:20 PM CEST: Engineers extend the cluster again with updated Elasticsearch memory settings on part of the nodes. Cluster is now operational.
03:00 PM CEST: Cluster scaling is completed.luster is healthy and engineers are monitoring the situation. Performance metrics are returning to normal.
04:10 PM CEST: Incident is closed, all metrics are healthy now.
10:00 PM CEST: Rolling restart of cluster with increased Elasticsearch memory and fielddata cache is successfull without any impact on performance.
Root Cause
A rolling restart of the Elasticsearch cluster caused severe memory and CPU pressure as the cluster was under-scaled for such an operation. While nodes were being restarted, cache eviction and rebuild caused search queries to be delayed, propagating the issue to dependent components.
Finally, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks on improving their understanding of the incident and determining how to make changes that improve our services and processes.