Southern Asia PoP may have experienced delays with messages being written to Storage

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

The incident started at about 21:29 UTC (13:29 PST). Due to extremely high CPU, messages were not being written to Storage for any publishes that occurred in our Mumbai PoP, however Storage reads were successful for any data persisted prior to the incident (with some latency). No data was lost, rather, it queued up until the writers were able to successfully catch up.

The resolution came when we restarted the Storage processes. The incident concluded at about 21:58 UTC (13:58 PST).

Mitigation Steps and Recommended Future Preventative Measures

We have updated code to prevent the errors caused by deleted records in the distributed data storage.

Posted Feb 11, 2021 - 23:18 UTC

Resolved

This incident has been resolved.

Posted Jan 05, 2021 - 23:05 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 05, 2021 - 22:10 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 05, 2021 - 22:10 UTC

Update

We are continuing to investigate this issue.

Posted Jan 05, 2021 - 22:09 UTC

Investigating

Around 21:29 UTC (13:29 PST), customers in our Southern Asia PoP may have experienced delays between the time messages were published and the time they were written to storage. There were also some delays in read requests for data that was already persisted. All messages were eventually stored by 21:56 UTC (13:56 PST) and all read latencies were recovered by 21:58 UTC (13:58 PST).

Posted Jan 05, 2021 - 22:09 UTC

This incident affected: Realtime Network (Storage and Playback Service) and Points of Presence (Southern Asia Points of Presence).