Querying and alerting outage due to underlying cloud provider failure

Incident Report for Honeycomb

Resolved

Our underlying AWS us-east-1 dependencies have recovered and all systems at Honeycomb continue to be stable.

Posted Dec 07, 2021 - 18:13 PST

Update

We are continuing to monitor the recovery of our underlying AWS us-east-1 dependencies.

Posted Dec 07, 2021 - 15:42 PST

Monitoring

We have worked around the underlying cloud infrastructure issue, and alerting is now healthy. All systems at Honeycomb are now operational and we are monitoring cloud provider performance.

Posted Dec 07, 2021 - 15:07 PST

Identified

We are still experiencing issues with Trigger and SLO Burn alert evaluation; approximately 7 out of 10 evaluations are failing. We are working to mitigate this failure.

Posted Dec 07, 2021 - 14:38 PST

Monitoring

We are seeing improvement in the underlying cloud infrastructure in us-east-1, and all telemetry, queries and alerts appear to be flowing nominally.

Posted Dec 07, 2021 - 13:53 PST

Update

Event ingest and querying is working normally. Trigger and SLO based alerting appear to be recovering. Cloudwatch metrics are currently not flowing to Honeycomb. We are continuing to monitor the situation.

Posted Dec 07, 2021 - 13:38 PST

Update

Cloudwatch ingest is now delayed by 15 min rather than down. Querying appears to be recovering. Alerts are failing at a lower rate, but still not yet running successfully each minute.

Posted Dec 07, 2021 - 12:34 PST

Update

We are continuing to work on a fix for this issue.

Posted Dec 07, 2021 - 11:36 PST

Update

We are still successfully ingesting all telemetry data except CloudWatch metrics. Any queries that fail with "unable to store results file" can safely be retried. Triggers continue to be partially degraded.

Posted Dec 07, 2021 - 11:34 PST

Update

Kinesis ingest from our AWS CloudWatch integration for metrics is impaired. Otherwise, impact appears to have stabilized, and we continue to await recovery of underlying AWS services.

Posted Dec 07, 2021 - 10:24 PST

Update

We have verified event ingest into honeycomb is working correctly. Some users querying Honeycomb may see "failed to save results" errors; these will succeed if query is retried. Trigger and SLO alerting is intermittently functioning.

Posted Dec 07, 2021 - 09:35 PST

Identified

This issue is due to an AWS outage; we've reached out to them.

Posted Dec 07, 2021 - 08:02 PST

Investigating

Triggers and SLOs are not currently alerting. We believe we've identified the cause and are working on a fix. Data ingest and querying should be unaffected.

Posted Dec 07, 2021 - 07:59 PST

This incident affected: api.honeycomb.io - US1 Event Ingest, ui.honeycomb.io - US1 Querying, and ui.honeycomb.io - US1 Trigger & SLO Alerting.