Search & Navigation Incident - EMEA Region (EMEA01)
Incident Report for Optimizely Service
Postmortem

SUMMARY

Optimizely Search & Navigation (formerly Find) is a Cloud based Enterprise Search solution that delivers both enhanced relevance and powerful search functionality to websites. During mid-January we experienced intermittent episodes which impacted service functionality in the EU Digital Experience Cloud region. The following report describes additional details around the January events.

DETAILS

On January 13, 2022, at 15:25 UTC the Engineering team received an alert related to the Search & Navigation EMEA01 cluster within the EU Digital Experience Cloud region.  Immediately upon receiving the alert, troubleshooting began, and the team identified a condition with high JVM Heap Memory usage which did not release on their own.  The affected nodes were gracefully restarted to clear the memory hooks and service was fully operational by 16:22 UTC.

On January 20, 2022, at 14:04 UTC, a similar issue presented itself and it was determined to be the same issue reoccurring.  This reoccurrence was observed during a routine automated deployment which also encountered an unexpected surge in volume.  This surge also caused the deployment to fail presumably due to an increase in JVM Heap Memory leakage. Additional corrective measures were implemented immediately, and service was restored by 19:10 UTC. 

On January 24, 2022, at 8:26 UTC, a Deployment which began outside of EMEA business hours was unable to complete in time due to its size on the cluster.  This additional load upon the cluster caused several data nodes to expend their available Heap Memory.  This in turn triggered these nodes to ungracefully disconnect from the cluster resulting in a red status upon the cluster.  This conditional state (Unhealthy) disallowed the automation subroutines to restart these nodes. Deployment was halted and manual mitigation measures were immediately implemented. The service was restored by 9:50 UTC

On January 25, 2022, at 9:44 UTC, the JVM Heap Memory issue presented again. The issue was immediately identified, and mitigation measures put in place. However, an incorrect setting did not trigger a shard balancing process which subsequently delayed the cluster’s return to its normal healthy state.  The shard reallocation setting was reconfigured to prevent future errors and reduce the high heap usage, and the service was fully operational by 11:41 UTC. 

TIMELINE

January 13, 2022

15:25 UTC – First alert and automation restarts were triggered. 

15:26 UTC – Alert acknowledged and troubleshooting initiated.

15:03 UTC – Unhealthy nodes were identified and restarted.

_16:22 UTC – Service operation recovered. 
_

January 20, 2022

12:40 UTC – First alert triggered. Cluster became unstable but was able to handle most requests. Initial mitigations began.

14:10 UTC – Critical alert triggered, Additional mitigation was performed

14:30 UTC – STATUSPAGE updated 

15:30 UTC – Additional resources were engaged to help with the mitigation. Cluster started slowly recovering

17:30 UTC – Additional mitigation was applied to speed the recovery rate.

19:10 UTC – All Services restored with limited functionality 

19:30 UTC – Service was fully recovered. 

January 24, 2022

08:26 UTC – First alert and automation restarts were triggered. 

08:30 UTC – Mitigation action was performed. 

09:05 UTC - Service started recovering and being able to handle requests. 

09:19 UTC – STATUSPAGE updated 

09:50 UTC – Service was fully recovered. 

January 25, 2022

09:44 UTC – First alert and automation restarts were triggered. 

10:02 UTC – STATUSPAGE updated 

10:22 UTC – Mitigation action was performed. Cluster started slowly recovering

11:37 UTC – Issue identified and mitigation actions were performed. 

11:41 UTC – Critical alert resolved and service fully operational.

14:55 UTC – Root cause identified and the Engineering team started working on long-term mitigation actions. 

ANALYSIS

According to the results of the investigation, the cluster was overwhelmed by several large customer indices which had grown beyond the recommended size configuration.  When an index exceeds the recommended size allocation, it will cause one or more nodes to expend the available Heap Memory with the potential for additional cluster failure. The automatic deployment routine used on the EMEA01 cluster requires a substantially longer runtime period and should be reconfigured accordingly.

IMPACT

During this series of events, a subset of requests to the EMEA01 Search & Navigation cluster may have experienced network timeouts (5xx-errors), or slower than normal response times while attempting to connect. 

CORRECTIVE MEASURES

Short-term mitigation

  • Restarting of unhealthy node(s) and reallocation of shards.
  • A halt of deployment during regional hours.
  • Correction of cluster allocation setting. 

Long-term mitigation

  • To avoid future customer impact, customers with indices exceeding, or near exceeding, recommended sizes will be migrated to an alternate cluster.
  • To improve on Deployment experience, analysis and further adjustment of deployment runtime settings will be performed.

FINAL WORDS

We are sorry! We recognize the negative impact to affected customers and regret the disruption that you sustained. We have a strong commitment to our customers, and to deliver high availability services including Search & Navigation. We will continue to prioritize our efforts to overcome these recent difficulties, and diligently apply lessons-learned to avoid future events of similar nature to ensure that we continue to develop Raving Fans!

Posted Feb 04, 2022 - 16:08 UTC

Resolved
This incident has been resolved.

A Post Mortem will be published as soon as it becomes available.
Posted Jan 24, 2022 - 10:19 UTC
Monitoring
The service has been recovered. We are continuing to monitor the health of the service for any further issues.
Posted Jan 24, 2022 - 09:50 UTC
Identified
We are currently investigating an event that is impacting the availability on the Search & Navigation (FIND) service in the EMEA region, EMEA01. The cause has been identified and we are working on mitigation.
A subset of clients will be experiencing high latency or 5xx-errors.
Posted Jan 24, 2022 - 09:18 UTC