Delays in receiving Presence events, Push notifications, and messages being saved to Storage

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 15:42 UTC on 2/9/2021 and again at 16:36 UTC on 2/10/2021 we observed Presence API latency and errors, which had a cascading effect on some customer implementations of Presence that resulted in a spike in traffic that caused brief delays in message writes to Storage and Push notifications being sent. We determined that the root cause was that a database in the critical path for the Presence service was reaching capacity limits. In order to mitigate the immediate issue we restarted impacted processes to clear the backlog and added capacity in affected regions then saw Presence operating normally at 16:12 UTC, and the issue was fully resolved at 17:32 UTC. Directly following this incident we accelerated work that had already been in process to upgrade the database in the critical path, which is a multi-week process.

‌Unfortunately we saw the same issue reoccur at 16:40 UTC on 2/10/2021. Only Presence was impacted during this incident, the Subscribe, Storage and Push services were not affected. We again restarted impacted processes to clear the backlog and saw Presence operating normally at 18:10 UTC. This incident re-occurred because we did not yet have short-term optimizations complete on the database.

‌

Mitigation Steps and Recommended Future Preventative Measures

To prevent a similar issue in the future we are upgrading the database for our Presence services, which will continue over the next few weeks. In the meantime, we have made some optimizations to the Presence service as well as to our monitoring and processes to ensure we address systems that would be impacted before they reach a critical state.

Posted Feb 13, 2021 - 01:02 UTC

Resolved

We're genuinely sorry for the disruption today. We'll be back with a Root Cause Analysis of this issue.

Posted Feb 09, 2021 - 18:17 UTC

Monitoring

A fix has been implemented, and we are monitoring the results for the next 30 mins.

Posted Feb 09, 2021 - 17:31 UTC

Update

A fix has been deployed, and we see improvements since 04:12 PM UTC on Presence and Push delayed messages. We're continually working on delays in writing to Storage.

Posted Feb 09, 2021 - 16:34 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 09, 2021 - 16:16 UTC

Identified

Some customers may have experienced delays in receiving Presence events, Push notifications, and messages being saved to Storage.

Posted Feb 09, 2021 - 16:10 UTC

This incident affected: Points of Presence (North America Points of Presence, Southern Asia Points of Presence) and Realtime Network (Storage and Playback Service, Presence Service, Mobile Push Gateway).