Beams increased error rates
Incident Report for Pusher
Postmortem

On Friday, October 8th, 2021 at 00:54 UTC, Beams started receiving frequent high fan-out notification publishing requests. The main database instance CPU soon went to 100% which resulted in failing a portion of API requests. Engineers reduced the load on the database and the frequent publish requests stopped around the same time which concluded the incident and the system went back to normal. The notification delivery of some of the accepted publishing requests was delayed for a maximum of 2.5 hours.

After having investigated the root cause of the incident we have some mitigations planned for reducing the load of high fan-out publish requests on the database, in addition to limiting the allowed rate of publish requests, to prevent a reoccurrence of this issue in future.

Timeline

All times are in UTC on the 8th of October 2021:

  • 01:13 UTC Our Engineering on-call team was paged due to a percentage of API requests to Beams failing, the team immediately started an internal investigation.
  • 01:54 UTC After analysing the Pusher Beams monitoring stack and noticing high CPU usage on the main database instance, the team tuned the system to reduce the load on the database.
  • 02:02 AM UTC The system went back to normal
Posted Oct 12, 2021 - 13:29 UTC

Resolved
The issue is resolved and the system is back to normal.
Posted Oct 08, 2021 - 04:04 UTC
Investigating
We are currently investigating this issue.
Posted Oct 08, 2021 - 01:38 UTC
This incident affected: Beams.