Service degradation in Live API, Answers Serving, and Pages Serving
Incident Report for Yext
Postmortem

Summary

Starting at 4:30 PM ET on October 29th until 5:25 PM ET Live API, Answers Serving, and Pages Locators had degraded service.  Many requests failed and many others were slower or had incomplete results.

Root Cause

A failure in our service location layer in our US East service region led to a complete failure of the region. Service location is the layer of our infrastructure that allows our services to find and connect to each other.  Our monitoring promptly alerted us to the issue, and we failed all serving over to our other regions.  This led to some improvement, but those regions were not provisioned enough initially to handle the increased load.  The issue was ultimately resolved by increasing provisioning in the alternative regions and resolving the underlying issue with service location.

Remediation

We have already increased the provisioning both within the service location components in the US East region as well as the under provisioned components outside of the US East region.  We will also regularly assess the capacity at all of our consumer serving sites to ensure we can operate successfully and immediately if any one of them were to fail.  Over the next month we will be:

  • Improving monitoring on our service location layer to identify problems earlier
  • Improving our system’s ability to withstand a temporary problem in the service location layer
Posted Nov 04, 2020 - 12:29 EST

Resolved
This incident has been resolved.
Posted Oct 29, 2020 - 20:00 EDT
Monitoring
We have completed checking for additional sources of errors, and all components have been stable for the last hour. We will continue to monitor for issues.
Posted Oct 29, 2020 - 18:54 EDT
Update
Service levels have returned to normal and all components are fully operational. We are continuing to check for sources of regressions.
Posted Oct 29, 2020 - 17:51 EDT
Identified
We have identified issues and implemented mitigations. We are continuing to check for regressions and services are recovering.
Posted Oct 29, 2020 - 17:42 EDT
Investigating
We are currently investigating a service degradation in Live API, Answers Serving, and Pages Serving. Latencies are elevated and some requests may return errors. Pages Locators may not load at this time. We will update as soon as we have more information.
Posted Oct 29, 2020 - 16:45 EDT
This incident affected: Content (Content API), Pages (Pages Serving), and Search (Search Serving).