At 00:16 UTC on 4/4/2021, we observed increased Presence latency that resulted in increased timeouts and degraded channel join times. We determined the root cause was we did not have proper alerting set up for the database in the critical path for Presence service, so it did not scale appropriately. During the incident, the database was re-tuned repeatedly to ease the load and presence latencies returned to normal levels by 3:08 AM UTC; however, the database component retained a large processing backlog. More re-tuning changes were made (along with some server process restarts) to clear the backlog at 4:20 AM UTC, and all service returned to normal.
The underlying issues with the incident were database tuning and alerts. We have adopted a tuning strategy that balances traffic volume without adversely affecting performance for the short term. Besides, a new caching implementation will be rolled out to reduce the effect of traffic bursts on the database. We will be more aggressively resizing the database to a larger cluster to support increased demands on the Presence service in the near future.