API slowdow
Incident Report for Server Density
Postmortem

On October 28, between 02:15 and 04:55 UTC /alerts api endpoint slowed down from the usual 50ms to 500ms which caused slower loading of the notification center alerts. Initial response was to add processing capacity to compensate the observed slow responses and only at 04:11 we identified the cause to a stale MongoDB server on one of our database clusters.

At 02:17 UTC we started receiving a large number of alerts triggering on various parts of our infrastructure. After the initial troubleshooting and alert consolidation, we saw the API was suffering a slowdown which was affecting several services as posted at 03:11 UTC after it was confirmed by the secondary on-call. Soon after we identified the slowdown as we isolated /alerts as the affected API endpoint.

This faulty server has since been removed from the cluster and we will be reworking the code that relies on a stale secondary. We have also modified our public status policy to post immediately as soon as a service affecting alert triggers, instead of delaying to confirm the cause before hand or a having a second engineer analysis. Following updates will deliver confirmation or correction of the initial post.

Posted Nov 29, 2016 - 19:40 GMT

Resolved
This is now fully resolved. We'll be publishing a postmortem in the next few days.
Posted Oct 28, 2016 - 08:48 BST
Monitoring
The added capacity to this endpoint has caused the endpoint response time to return to normal. We will be monitoring the situation and confirming the root cause next.
Posted Oct 28, 2016 - 05:23 BST
Update
We have identified /alerts as the affected API endpoint. This causes a slowdown loading parts of the dashboard, namely the notification center.
We're proceeding to identify the cause and have added more capacity to that endpoint to mitigate the issue.
Posted Oct 28, 2016 - 04:46 BST
Investigating
We're currently investigating a slowdown with one of our API endpoints.
Posted Oct 28, 2016 - 04:11 BST