Presence Errors and latency globally, delayed presence events, and PubSub message delays for a small number of customers.
Incident Report for PubNub
Postmortem

Problem Description, Impact, and Resolution

At 15:42 UTC on 2/9/2021 and again at 16:36 UTC on 2/10/2021 we observed Presence API latency and errors, which had a cascading effect on some customer implementations of Presence that resulted in a spike in traffic that caused brief delays in message writes to Storage and Push notifications being sent. We determined that the root cause was that a database in the critical path for the Presence service was reaching capacity limits. In order to mitigate the immediate issue we restarted impacted processes to clear the backlog and added capacity in affected regions then saw Presence operating normally at 16:12 UTC, and the issue was fully resolved at 17:32 UTC. Directly following this incident we accelerated work that had already been in process to upgrade the database in the critical path, which is a multi-week process.

Unfortunately we saw the same issue reoccur at 16:40 UTC on 2/10/2021. Only Presence was impacted during this incident, the Subscribe, Storage and Push services were not affected. We again restarted impacted processes to clear the backlog and saw Presence operating normally at 18:10 UTC. This incident re-occurred because we did not yet have short-term optimizations complete on the database.

Mitigation Steps and Recommended Future Preventative Measures

To prevent a similar issue in the future we are upgrading the database for our Presence services, which will continue over the next few weeks.  In the meantime, we have made some optimizations to the Presence service as well as to our monitoring and processes to ensure we address systems that would be impacted before they reach a critical state.


Problem Description, Impact, and Resolution

At 19:18 UTC on 2/10/2021 we observed some customers who were using TLS and ps.pndsn.com, pubnub.net, or most *.pubnubapi.com origins were unable to connect to our network. The issue was resolved at 00:08 UTC on 2/11/2021. This issue occurred due to one of our third-party vendors encountering an issue on their end, resulting in their inability to process our TLS traffic.

Mitigation Steps and Recommended Future Preventative Measures

We quickly routed traffic around the affected systems and service had returned to normal for most customers within one hour. Traffic was later restored for all customers after our provider resolved their issue.

Posted Feb 13, 2021 - 01:07 UTC

Resolved
This incident has been resolved.
Posted Feb 10, 2021 - 23:35 UTC
Update
All systems are operating normally. We are still in monitoring state with our traffic routed around the failure, while our provider continues to work on their fixes.
Posted Feb 10, 2021 - 23:05 UTC
Monitoring
We are in a monitoring state while our provider works on a fix. We have routed traffic around the failure and continue to monitor closely on our side. At present we see no issues.
Posted Feb 10, 2021 - 22:15 UTC
Update
We have routed around the failure in our provider and most customers should have fully recovered by 20:00 UTC, continually improving until 20:30 UTC. Clients who do not respect DNS cache TTL may still see problems connecting, however.

All systems are functioning normally except customers might be experiencing some presence latency which could lead to timeouts.
Posted Feb 10, 2021 - 21:57 UTC
Update
We have routed around the failure in our provider and clients should be able to connect to our servers normally. Clients who do not respect DNS cache TTL may still see problems connecting, however.

All systems are functioning normally except customers might be experiencing some presence latency which could lead to timeouts.
Posted Feb 10, 2021 - 21:43 UTC
Update
Customers may still experience trouble connecting to our Virginia-USA POP due to failure reported by one of our providers. We are continuing to work on a fix for this issue.
Posted Feb 10, 2021 - 20:28 UTC
Update
Customers may still experience trouble connecting to our Virginia-USA POP. We are continuing to work on a fix for this issue.
Posted Feb 10, 2021 - 20:11 UTC
Update
Customers may experience trouble connecting to our Virginia-USA POP. We are continuing to work on a fix for this issue.
Posted Feb 10, 2021 - 19:40 UTC
Update
We have observed very few service errors since 18:07 UTC as we continue to work on an elevated latency fix.
Posted Feb 10, 2021 - 19:22 UTC
Update
We have observed very few service errors since 18:07 UTC as we continue to work on an elevated latency fix.
Posted Feb 10, 2021 - 18:50 UTC
Update
We continue to observe errors as we work on a fix for this issue.
Posted Feb 10, 2021 - 18:38 UTC
Update
We have observed a recurrence of errors as we continue to work on a fix for this issue.
Posted Feb 10, 2021 - 18:07 UTC
Identified
We have observed very few errors in the past 10 minutes as we continue to investigate.
Posted Feb 10, 2021 - 17:52 UTC
Update
We are continuing to investigate this issue.
Posted Feb 10, 2021 - 17:26 UTC
Update
We are continuing to investigate this issue.
Posted Feb 10, 2021 - 17:01 UTC
Investigating
We are currently investigating this issue.
Posted Feb 10, 2021 - 16:57 UTC
This incident affected: Realtime Network (Publish/Subscribe Service, Storage and Playback Service, Presence Service, Mobile Push Gateway) and Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence).