On October 28, between 02:15 and 04:55 UTC /alerts api endpoint slowed down from the usual 50ms to 500ms which caused slower loading of the notification center alerts. Initial response was to add processing capacity to compensate the observed slow responses and only at 04:11 we identified the cause to a stale MongoDB server on one of our database clusters.
At 02:17 UTC we started receiving a large number of alerts triggering on various parts of our infrastructure. After the initial troubleshooting and alert consolidation, we saw the API was suffering a slowdown which was affecting several services as posted at 03:11 UTC after it was confirmed by the secondary on-call. Soon after we identified the slowdown as we isolated /alerts as the affected API endpoint.
This faulty server has since been removed from the cluster and we will be reworking the code that relies on a stale secondary. We have also modified our public status policy to post immediately as soon as a service affecting alert triggers, instead of delaying to confirm the cause before hand or a having a second engineer analysis. Following updates will deliver confirmation or correction of the initial post.