At approximately 14:20 UTC (06:20 PST), all services started to experience elevated latencies in the Tokyo PoP

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 14:18 UTC on 2021-02-19, we observed connection failures in our Tokyo PoP related to our underlying cloud provider hardware failure, which they declared as an incident. The failures at the cloud provider caused multiple alerts, which buried the alert that pointed to the core issue, which would have shown this failure much earlier to enable a faster time to resolution. Once the issue was identified we were able to quickly route around the provider at 17:05 UTC.

There was no further customer impact after 18:25 UTC. And at 21:01 UTC the underlying hardware issue recovered and we were able to return our system to its normal operating posture and we resolved the incident.

Mitigation Steps and Recommended Future Preventative Measures

To reduce the impact of a similar issue in the future, we are going to create better alerts for these kinds of edge failures so we can more quickly identify the issue and take action.

Posted Feb 24, 2021 - 01:37 UTC

Resolved

The incident has not resurfaced for the past 90 minutes and the cloud provider has declared their incident resolved and we have stop routing around the issue.

We are resolving this issue, and we will follow up with a post-mortem once we have collected and analyzed all the data.

We apologize for the impact this may have had on your service. Please reach out to us by contacting PubNub Support (support@pubnub.com) if you wish to discuss the impact on your service.

Posted Feb 19, 2021 - 21:33 UTC

Update

All services have been operating as expected since 18:25 UTC (10:25 PST). The underlying cloud provider hardware issue is still occurring and we are continuing to route around these known failures. We will continue to monitor this closely for at least the next 60 minutes before we determine that this incident is resolved.

Posted Feb 19, 2021 - 20:06 UTC

Monitoring

Currently, there are no known customer issues. The underlying cloud provider hardware issue is still occurring and we are continuing to route around these known failures. We will continue to keep an eye out for new failures and mitigate any issues they may cause.

Posted Feb 19, 2021 - 18:25 UTC

Investigating

Engineering is still working to resolve the underlying issue. We will continue to provide timely updates to report any changes or progress.

Posted Feb 19, 2021 - 18:01 UTC

Update

The engineering team has identified an underlying problem with our cloud provider's infrastructure and is taking the necessary steps to mitigate and resolve it as quickly as possible. We have routed around the failed nodes to restore service. We will remain in this posture until the underlying issue is resolved.

Posted Feb 19, 2021 - 17:09 UTC

Identified

The engineering team has identified the issue and is taking the necessary steps to mitigate and resolve it as quickly as possible. Engineering is currently routing traffic around the affected region.

Posted Feb 19, 2021 - 17:05 UTC

Investigating

All services are experiencing elevated latencies in the Tokyo PoP starting at approximately 14:20 UTC (06:20 PST). PubNub Technical Staff is investigating and more information will be posted as it becomes available.

If you are experiencing issues that you believe to be related to this incident, please report the details to PubNub Support (support@pubnub.com).

Posted Feb 19, 2021 - 16:56 UTC

This incident affected: Realtime Network (Publish/Subscribe Service, Storage and Playback Service, Stream Controller Service, Presence Service, Access Manager Service, Mobile Push Gateway), Functions (Functions Service), and Points of Presence (Asia Pacific Points of Presence).