US-East PoP may have experienced delays with messages being written to Storage

Incident Report for PubNub

Postmortem

Problem Description, Impact, and Resolution

On Jan 11, 2021, at 20:45 UTC messages stopped being written to Storage for publishes in our US-East PoP resulting in those messages not being available in history during the incident. The delay was caused by slow database writes caused by a problem with the way deleted records are handled in the distributed data store in some scenarios. We were able to restart the process and the issue was resolved at 22:23 UTC. During the incident, Storage reads were successful for any data that persisted prior to the incident, and no data was lost, instead all messages published during the incident were queued until the writers were able to successfully catch up.

Mitigation Steps and Recommended Future Preventative Measures

We have updated the code to prevent the errors caused by deleted records in the distributed data storage.

Posted Jan 26, 2021 - 18:59 UTC

Resolved

This incident has been resolved.

Posted Jan 11, 2021 - 23:43 UTC

Monitoring

A fix has been implemented at 10:23 PM UTC and we are monitoring the results for the next 1 hour.

Posted Jan 11, 2021 - 22:30 UTC

Update

Latencies have recovered in US West, also catching up with the messages published that were written to storage.

Posted Jan 11, 2021 - 22:16 UTC

Update

We are continuing to investigate this issue and seeing elevated latencies in the US West and EU Central.

Posted Jan 11, 2021 - 22:03 UTC

Identified

The issue has been identified and we are working on a fix.

Posted Jan 11, 2021 - 21:29 UTC

Investigating

Starting around 20:44UTC (12:44 PST), customers in our US-East PoP may have experience delays between the time messages were published and the time they were written to storage.

Posted Jan 11, 2021 - 21:25 UTC

This incident affected: Points of Presence (North America Points of Presence) and Realtime Network (Storage and Playback Service).