High write latency and other intermittent issues related to underlying infrastructure stability

Incident Report for Grafana Cloud

Resolved

The unplanned node restarts have stopped and everything has been stable for 2 hours, as such we are marking this incident resolved and will continue working with our hosting provider to understand the root cause.

Posted Aug 11, 2020 - 20:45 UTC

Update

The underlying situation is still ongoing, however the clusters were reconfigured and additional resources added to add more stability.

Posted Aug 11, 2020 - 17:35 UTC

Update

Node reboots have resumed, we are still working with the hosting provider. Some data loss has occurred on the `https://logs-prod2.us-central1.grafana.net` cluster.

Edit, incorrect URL was provided, correct url updated to: `https://logs-prod2.us-central1.grafana.net`

Posted Aug 11, 2020 - 13:35 UTC

Update

We are still working with our hosting provider to better understand the node reboots, at this point no reboots have been observed for an hour and clusters are stable.

Posted Aug 11, 2020 - 13:02 UTC

Update

The underlying cause has been traced to unexpected node reboots in the cluster. The source of this problem is under investigation though we are still seeing intermittent node reboots which does still bubble up to higher latency on some requests

Posted Aug 11, 2020 - 11:46 UTC

Monitoring

The source of the latency has been traced to a network interruption within the us-central1 region, the interruption has cleared and all operations have returned to normal, we will continue to monitor.

Posted Aug 11, 2020 - 10:58 UTC

Investigating

We are investigating a large spike in write latency on both `https://logs-prod-us-central1.grafana.net` and `https://logs-prod2.us-central1.grafana.net` endpoints

Edit, incorrect URL's provided: Correct URL's were updated to : `https://logs-prod-us-central1.grafana.net` and `https://logs-prod2.us-central1.grafana.net`

Posted Aug 11, 2020 - 10:45 UTC

This incident affected: Grafana Cloud: Loki (GCP US Central - prod-us-central-0).