Beginning at 14:20 UTC 2020-10-28, the PubNub History and Push services had a sharp increase in latency for the US-east region because a node failed but was unable to restart swiftly, resulting in delayed service restoration. Manual operation tools deployed to fully restart the system by 15:05 UTC bringing the incident to full resolution.
To prevent a similar issue from occurring in the future we have introduced additional health checks to ensure unhealthy nodes are removed from service. In addition, we are making longer term changes to service discovery and load balancing on this service to ensure a failed node would not impact latencies.