The Presence service is reporting errors
Incident Report for PubNub
Postmortem

Problem Description, Impact, and Resolution

At 00:16 UTC on 4/4/2021, we observed increased Presence latency that resulted in increased timeouts and degraded channel join times. We determined the root cause was we did not have proper alerting set up for the database in the critical path for Presence service, so it did not scale appropriately. During the incident, the database was re-tuned repeatedly to ease the load and presence latencies returned to normal levels by 3:08 AM UTC; however, the database component retained a large processing backlog. More re-tuning changes were made (along with some server process restarts) to clear the backlog at 4:20 AM UTC, and all service returned to normal.

Mitigation Steps and Recommended Future Preventative Measures 

The underlying issues with the incident were database tuning and alerts. We have adopted a tuning strategy that balances traffic volume without adversely affecting performance for the short term. Besides, a new caching implementation will be rolled out to reduce the effect of traffic bursts on the database. We will be more aggressively resizing the database to a larger cluster to support increased demands on the Presence service in the near future.

Posted Apr 07, 2021 - 19:27 UTC

Resolved
We're genuinely sorry for the disruption today. We'll be back with a Root Cause Analysis of this issue.
Posted Apr 05, 2021 - 05:06 UTC
Monitoring
A fix has been deployed and we see improvements since 04:30 UTC on Presence latency. We'll continue to monitor for the next 30 mins.
Posted Apr 05, 2021 - 04:36 UTC
Identified
A fix has been deployed, and we see improvements since 03:00 UTC on Presence error rates. We continue to be working for presence join latency.
Posted Apr 05, 2021 - 03:51 UTC
Update
We are continuing to investigate this issue.
Posted Apr 05, 2021 - 02:56 UTC
Update
A fix has been deployed in the EU region PoP, and we see improvements since 01:44 UTC on Presence latency and error rates. We're continually working on Presence latency and error rates in the US regions.
Posted Apr 05, 2021 - 02:18 UTC
Update
Engineering is still working to resolve the underlying issue. We will continue to provide timely updates to report any changes or progress. If you are experiencing issues related to this incident, you may report details to PubNub Support (support@pubnub.com). Please provide as much detail as possible: sub-key, logs, errors, timestamps, etc.
Posted Apr 05, 2021 - 01:45 UTC
Update
We are continuing to investigate this issue.
Posted Apr 05, 2021 - 01:07 UTC
Investigating
At about 12:07 UTC (17:07 PDT), the Presence service began to report errors in all regions. PubNub Technical Staff is investigating, and more information will be posted as it becomes available.

If you are experiencing issues that you believe to be related to this incident, please report the details to PubNub Support (support@pubnub.com).
Posted Apr 05, 2021 - 00:29 UTC
This incident affected: Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence) and Realtime Network (Presence Service).