Sandbox - Knowledge Graph Search Unavailable
Incident Report for Yext
Postmortem

Summary

On December 22 at 7:43pm ET, a number of nodes were removed from the Knowledge Graph Search cluster for routine maintenance. This is typically a non-event. However, the prior configuration change to remove the routing of search requests to those nodes had been incompletely applied, and all requests that were sent to the removed nodes failed. The issue was discovered the next morning at 10:30am ET, and the configuration change to stop routing requests to those nodes was fully applied by 11am ET. That successfully restored service, and no data was lost

Root Cause

Removing nodes from a search cluster is performed by an automated process. In this case, it failed to function as expected due to drift in the node's actual state from its description in Terraform. The time to detection was lengthened because post-change verification is performed on the search cluster itself, which was completely operational.

Mitigation

We will implement periodic automated checks for drift in state between actual and checked-in configuration of infrastructure nodes, and we will also tie relevant application dashboards more closely to the Infrastructure change process to identify more classes of errors and reduce the time to detection.

Posted Jan 08, 2021 - 19:03 EST

Resolved
This incident has been resolved.
Posted Dec 23, 2020 - 13:46 EST
Monitoring
The Sandbox search cluster is fully operational, and Knowledge Graph is working as expected. We will continue to monitor the situation.
Posted Dec 23, 2020 - 11:19 EST
Identified
We identified a configuration change that had been incompletely applied as the proximate cause. We are restoring the configuration to a working state, and we expect to have completely mitigated the problem within 30 minutes.
Posted Dec 23, 2020 - 10:53 EST
Investigating
In Sandbox, the ElasticSearch cluster providing search over the Knowledge Graph is unavailable. We are investigating the issue.
Posted Dec 23, 2020 - 10:43 EST
This incident affected: Sandbox.