At 15:42 UTC on 2/9/2021 and again at 16:36 UTC on 2/10/2021 we observed Presence API latency and errors, which had a cascading effect on some customer implementations of Presence that resulted in a spike in traffic that caused brief delays in message writes to Storage and Push notifications being sent. We determined that the root cause was that a database in the critical path for the Presence service was reaching capacity limits. In order to mitigate the immediate issue we restarted impacted processes to clear the backlog and added capacity in affected regions then saw Presence operating normally at 16:12 UTC, and the issue was fully resolved at 17:32 UTC. Directly following this incident we accelerated work that had already been in process to upgrade the database in the critical path, which is a multi-week process.
Unfortunately we saw the same issue reoccur at 16:40 UTC on 2/10/2021. Only Presence was impacted during this incident, the Subscribe, Storage and Push services were not affected. We again restarted impacted processes to clear the backlog and saw Presence operating normally at 18:10 UTC. This incident re-occurred because we did not yet have short-term optimizations complete on the database.
Mitigation Steps and Recommended Future Preventative Measures
To prevent a similar issue in the future we are upgrading the database for our Presence services, which will continue over the next few weeks. In the meantime, we have made some optimizations to the Presence service as well as to our monitoring and processes to ensure we address systems that would be impacted before they reach a critical state.