Elevated Latencies Detected for US-East Region
Incident Report for PubNub
Postmortem

Problem Description, Impact, and Resolution 

We observed a novel traffic pattern that resulted in an unexpected increase in load on our auth services that created latency for customers using several services in the us-east-1 region. The root cause of this behavior was that the unique traffic pattern resulted in unforeseen interactions with configurations that prevented additional capacity from being utilized when it was brought online. We also uncovered a bug in our balancer configuration that limited the number of auth servers that could be utilized by the services, this likely exacerbated the issues we observed.

Mitigation Steps and Recommended Future Preventative Measures 

We resolved the bug in the balancer configuration and we have made several changes to the way that configurations are used in instances of the traffic patterns we observed and those like it. In addition, we have increased the base capacity of the auth services to ensure we do not see another increase in auth latency while we observe whether the fixes we made have the desired impact.

Posted Oct 06, 2020 - 17:00 UTC

Resolved
This incident has been resolved.
Posted Sep 20, 2020 - 17:30 UTC
Update
Latencies have been normal for over 30 minutes so we are moving to Resolved status. If you are experiencing any latency impact, please report it to PubNub Support and we will respond with urgency.
Posted Sep 20, 2020 - 17:30 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 20, 2020 - 16:44 UTC
Monitoring
Latencies have been in the normal range. We will monitor for the next 30 minutes and move to Resolved if latencies remain stable.
Posted Sep 20, 2020 - 16:40 UTC
Identified
The issue has been identified and latencies are decreasing. If that trend continues we will move to Monitoring status.
Posted Sep 20, 2020 - 15:29 UTC
Investigating
We are currently investing elevated latencies in US-East Region. Impact currently appears to be minimal at the moment.
Posted Sep 20, 2020 - 14:19 UTC
This incident affected: Realtime Network (Publish/Subscribe Service, Storage and Playback Service, Stream Controller Service, Presence Service, Access Manager Service, Mobile Push Gateway) and Points of Presence (North America Points of Presence).