Intermittent ingestion and query issues due to underlying infrastructure stability

Incident Report for Grafana Cloud

Resolved

The unplanned node restarts have stopped and everything has been stable for 2 hours, as such we are marking this incident resolved and will continue working with our hosting provider to understand the root cause.

Posted Aug 11, 2020 - 20:45 UTC

Monitoring

The underlying situation is still ongoing, however the clusters were reconfigured and additional resources added to add more stability.

Posted Aug 11, 2020 - 17:56 UTC

Investigating

We're seeing our underlying infrastructures nodes restart randomly causing minor blips in the read and write path. We anticipate no dataloss and any failed requests will be retried. If you're seeing sustained issues, please let us know ASAP.

We're working with our infrastructure provider to identify the cause and rectify.

Posted Aug 11, 2020 - 13:47 UTC

This incident affected: Grafana Cloud: Prometheus (GCP US Central - prod-us-central-0: Querying, GCP US Central - prod-us-central-0: Ingestion).