Loki Write Errors
Incident Report for Grafana Cloud
Resolved
Should have been marked resolved with last update.
Posted Jan 23, 2021 - 22:10 UTC
Update
The scaling event is concluded, in the future additional protections will be put in place to make sure scaling does not exceed the rate the cluster can rebalance to avoid any such interruptions in the future.
Posted Jan 23, 2021 - 21:38 UTC
Monitoring
During a scale down of the Loki cluster at endpoint logs-prod-us-central1.grafana.net, too many ingesters were scaled down at one time resulting in a failure to write for some log streams while the cluster rebalanced. Only a portion of the total overall log streams handled by the cluster were impacted.

For a 5 minute window from 21:03UTC to 21:08UTC some customers may have noticed writes failing with 500 errors. All writes should have been retried by default retry settings for Promtail and/or the Grafana Cloud Agent, no data loss is expected.

The issue has cleared but will be monitored for a few more minutes.
Posted Jan 23, 2021 - 21:30 UTC
This incident affected: Grafana Cloud: Loki (GCP US Central - prod-us-central-0).