Elevated error rates for traffic in multiple regions
Incident Report for Vercel
Postmortem

RCA – Elevated Request Times

Summary of Impact: Between Sep 2, 14:48 UTC, and 15:25 UTC, the Vercel Edge Network experienced elevated request times for global traffic. A series of events caused our networking infrastructure to get overloaded upon a specific traffic pattern, which in turn caused part of production traffic to experience elevated response times and, in some regions, elevated error rates.

Root Cause: On Sep 2, 14:48 UTC, the networking infrastructure that powers the Vercel Edge Network suffered from increased load caused by increased error rates from an upstream database and API. Given the hot-path for production traffic depends on said database and APIs, the Vercel Edge Network was not able to serve requests in a timely manner.

Mitigation: On Sep 2, 15:00 UTC, the Vercel Infrastructure Team identified the root cause, and started collaborating with the relevant upstream provider in order to expand capacity. Furthermore, in parallel, the Vercel Infrastructure team started working on further caching of the aforementioned API, so that the Vercel Edge Network is more resilient to degraded upstream availability.

On Sep 2, 15:15 UTC, the upstream provider concluded work to expand capacity, effectively mitigating the root cause. On Sep 2, 15:25 UTC, the Vercel Edge Network had fully recovered, and global traffic was operating nominally.

Next Steps: Vercel is fully committed to preventing any degradation to production traffic. We are working closely with the relevant upstream provider in order to increase their resiliency and, in parallel, on rearchitecting the relevant networking components in order to reduce/remove the dependency on this specific upstream provider.

  • [COMPLETED] Statically increase the capacity of the upstream networking provider by one order of magnitude
  • [ONGOING] Work with the upstream networking provider to improve auto-scaling behavior
  • [ONGOING] Optimize the relevant networking hot-paths in order to prevent failure upon upstream unavailability
Posted Sep 07, 2021 - 19:50 UTC

Resolved
This incident has been resolved.
Posted Sep 02, 2021 - 17:07 UTC
Update
All systems normal.
Posted Sep 02, 2021 - 17:06 UTC
Monitoring
The issue affecting a few regions has been mitigated. We are continuing to monitor.
Posted Sep 02, 2021 - 16:20 UTC
Update
The issue affecting a few regions has been mitigated. We are continuing to monitor.
Posted Sep 02, 2021 - 16:19 UTC
Identified
We are currently investigating elevated error rates for traffic in multiple regions. We will provide further updates when available.
Posted Sep 02, 2021 - 15:23 UTC