Search & Navigation Incident - EMEA Region (EMEA17)
Incident Report for Optimizely Service
Postmortem

SUMMARY 

Optimizely Search & Navigation (formerly Find) is a Cloud based Enterprise Search solution that delivers both enhanced relevance and powerful search functionality to websites. During April, we experienced intermittent episodes which impacted service functionality in the EU Digital Experience Cloud region. The following report describes additional details around the April events.

DETAILS 

On April 7, 2022, at 05:18 UTC the Engineering team received an alert related to the Search & Navigation EMEA17 cluster within the EU Digital Experience Cloud region. Troubleshooting started immediately and an initial mitigation effort included restarting the proxy node. Further investigation revealed a consistent high level of Java heap memory consumption, resulted in long running garbage collects in some nodes and blocking requests to those nodes. The affected nodes were gracefully restarted to clear the memory hooks and service was fully operational by 05:03 UTC. 

On April 11, 2022, at 09:40 UTC and April 13, 2022, at 09:21 UTC, a similar issue presented and was recovered itself after the problematic node was automatically restarted. 

On April 25, 2022, at 11:35 UTC the service experienced an outage which was due to a data node being evicted from the cluster. The connection issue was identified, and the affected node was immediately restarted which made the system fully operational by 11:59 UTC. The Engineering team has been working diligently to identify the root cause of the degradation, and long-term mitigation is being applied to reduce the impact. As the Search & Navigation service contains a multitude of components, including but not limited to, API calls towards the cluster, mitigation efforts required a complex and exhaustive investigation, thereby requiring additional time to stabilize and identify the resolution. 

TIMELINE 

April 7, 2022

05:18 UTC – First alert triggered, acknowledged, and troubleshooting initiated. 

05:29UTC – Mitigation step was initiated. 

05:29 UTC – Unhealthy nodes were identified and restarted. 

05:53 UTC – Service started recovering and was able to handle the requests. 

05:55 UTC – STATUSPAGE updated. 

06:15 UTC – Service was fully operational. 

 

April 11, 2022

09:40 UTC – First alert triggered 

09:41 UTC – Alert was acknowledged. Initial investigation began.

10:12 UTC – Node restarting was automatically performed. 

10:43 UTC – Service was fully recovered.

 

April 13, 2022

09:21 UTC – First alert triggered and was acknowledged immediately 

09:22 UTC – Node restarting was automatically performed. 

09:34 UTC – STATUSPAGE updated. 

09:49 UTC – Service was fully recovered.

 

April 25, 2022

11:35 UTC – First alert triggered 

11:38 UTC – Alert was acknowledged

11:43 UTC – STATUSPAGE updated. 

11:45 UTC – Mitigation steps were applied. 

11:59 UTC: Service was fully operational.  

ANALYSIS

Initial investigation determined that the issue was partially caused by long garbage collection period on specific nodes, which caused all requests to these nodes to become blocked, resulting in a huge number of connections being created and failing when the queue become full. The identification of high load on the data node recurrently points toward the fact that cluster is being overloaded due to several heavy indexes with overgrowing shard size. Migrating these heavy users off the cluster will reduce the stress on it by which mitigating the potential for additional cluster failure. 

Further examination indicated that the issue may have been caused by a tenant of the shared resource sending bulk requests, causing memory congestion on a few nodes on a regular basis. This created failed queries on the node with high heap usage. When that node is restarted, the Elasticsearch primary shard being indexed to is moved to another node, and the scenario replays itself until the cluster's internal shards have been re-balanced, and the service is fully operational.

IMPACT 

During this series of events, a subset of requests to the EMEA17 Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slower than normal response times while attempting to connect. 

CORRECTIVE MEASURES 

Short-term mitigation

  • Restarting of unhealthy node(s) and reallocation of shards.
  • Correction of cluster allocation setting. 

Long-term mitigation

  • To avoid future customer impact, customers with indices exceeding, or near exceeding, recommended sizes will be migrated to an alternate cluster.
  • Additional clusters have been provisioned in order to minimize the customer loads.
  • To manage growing accounts, a new tool that enables faster internal migrations within a cluster is under development.

FINAL WORDS

We are sorry! We recognize the negative impact to affected customers and regret the disruption that you sustained. We have a strong commitment to our customers, and to deliver high availability services including Search & Navigation. We will continue to prioritize our efforts to overcome these recent difficulties, and diligently apply lessons-learned to avoid future events of similar nature to ensure that we continue to develop Raving Fans!

Posted Apr 29, 2022 - 14:17 UTC

Resolved
This incident has been resolved.

A Post Mortem will be published as soon as it becomes available.
Posted Apr 25, 2022 - 15:38 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 25, 2022 - 12:10 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 25, 2022 - 11:59 UTC
Investigating
We are currently investigating an event that is impacting the availability on the Search & Navigation (FIND) service in the EMEA region, EMEA17.

A subset of clients will be experiencing high latency or 5xx-errors.
Posted Apr 25, 2022 - 11:43 UTC