Knowledge base delays
Incident Report for Kustomer
Postmortem

Summary

Starting at 9:19 am ET, KB-Web experienced a large increase in latencies causing clients to load their knowledge base sites.

What happened

  • Starting at 9:19 am ET, KB-Web experienced a large increase in latencies following a recent release that was due to the degradation of Cloudflare in Ashburn, VA (us-east-1 region).
  • At 9:57 am ET, Kustomer released a new deployment of our KB-API to reset memory for a known issue with KB-API memory leaks in hopes that this would solve the problem.
  • At 10:01 am ET, Cloudflare reported they were re-routing traffic to bypass the affected region.
  • At 10:45 am ET we rolled back our previous deployment to rule this out as an issue.
  • At 11:01 am ET, Cloudflare retroactively declared the end of degradation and latencies began to drastically improve.
  • At 11:15 am ET we re-deployed our previous deployment with no impact on latency. 

Impact

Loading times for KB web were very slow and sometimes timed out for the duration of the incident (9:06 am to 11:01 am ET). This only affected prod1.

Technical details

Cloudflare started maintenance on 4/19/2021 at 12:08 am ET.  At 1:10 am ET, they disabled IAD.  At 8:20 am ET, they enabled Ashburn with a single edge router and started to see immediate congestion to their origin network.  By 10:01 am ET, edge02 started serving traffic and the issue was resolved when Cloudflare engineers enabled the second edge router, which enabled the second PNI with the origin network, removing the congestion.

Certain external connections on a single edge were disaggregated and split across 2 edge routers. Unfortunately, not all of those links can sustain the traffic levels individually. There appears to have been  a discrepancy in recorded metrics, where one dashboard showed the link was congested at 100 Gbps, but another data stream only showed 80 Gbps (as did the metrics Cloudflare engineers were looking at during the maintenance). The automated page for interface congestion didn't trigger either, as the metric reported was 81% interface utilisation, which meant Cloudflare did not know what was going on until customers reported issues. Error rates in Ashburn were well below alerting levels and within the range of what we'd normally see.

Here is a link to their incident.

Lessons & Action Items

  • Cloudflare will analyze why the metrics that were being used during the maintenance showed a 20% discrepancy to make sure all metrics are accurate.
  • Cloudflare will ensure that all metrics are being checked during the maintenance to ensure that no errors or utilization issues are missed when bringing capacity back after maintenance.
  • Cloudflare will improve its network telemetry pipeline to remove all discrepancies and ensure everything is accurate.
  • Kustomer will notify internal teams more quickly when alerts are triggered.
Posted Jan 10, 2022 - 15:37 EST

Resolved
Kustomer is no longer experiencing a delay with the loading of knowledge base sites and all issues have been resolved.

If you are still experiencing issues or have additional concerns, please reach out to support@kustomer.com.
Posted Apr 19, 2021 - 11:28 EDT
Monitoring
Kustomer is currently experiencing a delay with the loading of knowledge base sites. A fix for the current knowledge base latency is underway with the results being monitored and assessed.

Please reach out to our Support team with any additional questions. You can reach us by going to https://help.kustomer.com/ and clicking "Contact Support" at the top of the page.
Posted Apr 19, 2021 - 11:16 EDT
Investigating
Kustomer is currently experiencing a delay with the loading of knowledge base sites. We are collecting information about the issue and our team is working to resolve it as quickly as possible. We will provide an update soon.

Please reach out to our Support team with any additional questions. You can reach us by going to https://help.kustomer.com/ and clicking "Contact Support" at the top of the page.
Posted Apr 19, 2021 - 10:56 EDT
This incident affected: Prod1 (US) (Knowledge base).