On Friday, October 8th, 2021 at 00:54 UTC, Beams started receiving frequent high fan-out notification publishing requests. The main database instance CPU soon went to 100% which resulted in failing a portion of API requests. Engineers reduced the load on the database and the frequent publish requests stopped around the same time which concluded the incident and the system went back to normal. The notification delivery of some of the accepted publishing requests was delayed for a maximum of 2.5 hours.
After having investigated the root cause of the incident we have some mitigations planned for reducing the load of high fan-out publish requests on the database, in addition to limiting the allowed rate of publish requests, to prevent a reoccurrence of this issue in future.
All times are in UTC on the 8th of October 2021: