Elevated latencies and errors in the US-East PoP

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 10:27 UTC on 15 April, 2021 we observed elevated latencies and errors in our US-East point of presence, affecting all services. The initial cause was a spike in usage in the Access Manager service. This was quickly resolved but cascading effects based on how the system handles errors and retries caused a continuation of the issues as these moved downstream. The system resolved the issues without intervention by 10:43 UTC though it took much longer than designed because of the cascading issues.

Mitigation Steps and Recommended Future Preventative Measures

In the short term we are considering safeguards on errors and retries for the Access Manager service that will speed the resolution if a similar issue were to occur. Also, in coming sprints the team will look to re-architect the approach the system uses for Access Manager errors in order to prevent a bottleneck from moving downstream.

Posted Apr 22, 2021 - 21:49 UTC

Resolved

We see improvements since 10:43 UTC. We'll continue to monitor for the next 30 mins.

Posted Apr 15, 2021 - 10:51 UTC

Investigating

At about 10:27 UTC, services began to experience elevated latencies and errors in the US-East PoP. PubNub Technical Staff is investigating and more information will be posted as it becomes available.

If you are experiencing issues that you believe to be related to this incident, please report the details to PubNub Support (support@pubnub.com).

Posted Apr 15, 2021 - 10:41 UTC