The Presence service is reporting errors

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

At 00:16 UTC on 4/4/2021, we observed increased Presence latency that resulted in increased timeouts and degraded channel join times. We determined the root cause was we did not have proper alerting set up for the database in the critical path for Presence service, so it did not scale appropriately. During the incident, the database was re-tuned repeatedly to ease the load and presence latencies returned to normal levels by 3:08 AM UTC; however, the database component retained a large processing backlog. More re-tuning changes were made (along with some server process restarts) to clear the backlog at 4:20 AM UTC, and all service returned to normal.

Mitigation Steps and Recommended Future Preventative Measures

The underlying issues with the incident were database tuning and alerts. We have adopted a tuning strategy that balances traffic volume without adversely affecting performance for the short term. Besides, a new caching implementation will be rolled out to reduce the effect of traffic bursts on the database. We will be more aggressively resizing the database to a larger cluster to support increased demands on the Presence service in the near future.

Posted Apr 07, 2021 - 19:27 UTC

Resolved

We're genuinely sorry for the disruption today. We'll be back with a Root Cause Analysis of this issue.

Posted Apr 05, 2021 - 05:06 UTC

Monitoring

A fix has been deployed and we see improvements since 04:30 UTC on Presence latency. We'll continue to monitor for the next 30 mins.

Posted Apr 05, 2021 - 04:36 UTC

Identified

A fix has been deployed, and we see improvements since 03:00 UTC on Presence error rates. We continue to be working for presence join latency.

Posted Apr 05, 2021 - 03:51 UTC

Update

We are continuing to investigate this issue.

Posted Apr 05, 2021 - 02:56 UTC

Update

A fix has been deployed in the EU region PoP, and we see improvements since 01:44 UTC on Presence latency and error rates. We're continually working on Presence latency and error rates in the US regions.

Posted Apr 05, 2021 - 02:18 UTC

Update

Engineering is still working to resolve the underlying issue. We will continue to provide timely updates to report any changes or progress. If you are experiencing issues related to this incident, you may report details to PubNub Support (support@pubnub.com). Please provide as much detail as possible: sub-key, logs, errors, timestamps, etc.

Posted Apr 05, 2021 - 01:45 UTC

Update

We are continuing to investigate this issue.

Posted Apr 05, 2021 - 01:07 UTC

Investigating

At about 12:07 UTC (17:07 PDT), the Presence service began to report errors in all regions. PubNub Technical Staff is investigating, and more information will be posted as it becomes available.

If you are experiencing issues that you believe to be related to this incident, please report the details to PubNub Support (support@pubnub.com).

Posted Apr 05, 2021 - 00:29 UTC

This incident affected: Points of Presence (North America Points of Presence, European Points of Presence, Asia Pacific Points of Presence, Southern Asia Points of Presence) and Realtime Network (Presence Service).