Grafana Cloud Status - Metrics Disruption

Metrics Disruption - cortex-prod-10 cluster

Incident Report for Grafana Cloud

Resolved

Engineering has identified and fully remediated the issue. We have observed complete recovery in the cortex-prod-10 cluster and backfilling of metrics is complete. At this time, we are considering this incident resolved.

No further updates.

Posted Mar 10, 2023 - 20:59 UTC

Update

We continue to observe a healthy cluster state for the cortex-prod-10 cluster and metrics backfill is proceeding as expected. Users should no longer experience failed metric queries or failed remote write actions.

Engineering is continuing action and investigation. We will continue to monitor the state of the cluster closely and continue to provide updates.

Posted Mar 10, 2023 - 20:16 UTC

Monitoring

As of 19:29 UTC, Engineering has applied mitigation efforts to the cortex-prod-10 cluster and we are seeing improvements with metric queries and remote write actions. As the cluster health improves, metrics from the affected time period will begin backfilling.

Engineering is continuing action and investigation. We will monitor the state of the cluster closely and continue to provide updates.

Posted Mar 10, 2023 - 19:36 UTC

Investigating

As of 18:45 UTC, we observed metric disruption in the cortex-prod-10 cluster, only. Some customers in this cluster may have experienced failed metric queries or failed remote write actions, likely manifesting as 500 errors.

Engineering is aware and actively engaged in investigation. We will provide updates as information is shared.

Posted Mar 10, 2023 - 18:55 UTC

This incident affected: Grafana Cloud: Prometheus (GCP US Central - prod-us-central-0: Querying).