Starting at 9:19 am ET, KB-Web experienced a large increase in latencies causing clients to load their knowledge base sites.
Loading times for KB web were very slow and sometimes timed out for the duration of the incident (9:06 am to 11:01 am ET). This only affected prod1.
Cloudflare started maintenance on 4/19/2021 at 12:08 am ET. At 1:10 am ET, they disabled IAD. At 8:20 am ET, they enabled Ashburn with a single edge router and started to see immediate congestion to their origin network. By 10:01 am ET, edge02 started serving traffic and the issue was resolved when Cloudflare engineers enabled the second edge router, which enabled the second PNI with the origin network, removing the congestion.
Certain external connections on a single edge were disaggregated and split across 2 edge routers. Unfortunately, not all of those links can sustain the traffic levels individually. There appears to have been a discrepancy in recorded metrics, where one dashboard showed the link was congested at 100 Gbps, but another data stream only showed 80 Gbps (as did the metrics Cloudflare engineers were looking at during the maintenance). The automated page for interface congestion didn't trigger either, as the metric reported was 81% interface utilisation, which meant Cloudflare did not know what was going on until customers reported issues. Error rates in Ashburn were well below alerting levels and within the range of what we'd normally see.
Here is a link to their incident.