Service Degradation in Customer Portal and API in Production and Sandbox
Incident Report for Yext
Postmortem

Summary

On October 27th, beginning at 8:10 p.m. ET, the Customer Portal and Platform API became unavailable. Users were unable to log in at this time to perform administrative functions and API calls were not processed. Live API and Pages Serving was not impacted. Yext engineers restored service at 11:45 p.m. ET. The following day, on October 28th, beginning at 9:05 p.m. ET, similar service disruptions were observed. Yext engineers immediately mitigated and restored service at 10:00 p.m. ET.

Root Cause

A regular maintenance operation on a critical metadata service caused the server to crash. This led to a cascading failure on servers which rely on the metadata service to serve the Customer Portal and Platform API. Downstream services attempting to reconnect would further exacerbate the failure and prevent the metadata service from recovering. After rerouting internal traffic, we were able to restore the metadata service and subsequently all downstream services.

Remediation

We have paused maintenance operations and are immediately prioritizing work to improve the robustness of this critical metadata service, which will prevent future errors of this kind and minimize the impact of any failure. Additionally, we will be making changes to prevent cascading failures on downstream services and remove dependencies on the metadata service.

Posted Nov 02, 2020 - 09:07 EST

Resolved
This incident has been resolved.
Posted Oct 28, 2020 - 23:19 EDT
Monitoring
The system is operational, we are continuing to monitor for regressions.
Posted Oct 28, 2020 - 22:11 EDT
Identified
We are investigating a service degradation in the Customer Portal and APIs in production and sandbox. Parts of the portal may be unavailable at this time. We will update as soon as we have more information.
Posted Oct 28, 2020 - 21:32 EDT
This incident affected: Content (Management API) and Customer Portal Login, Sandbox.