Issues accessing editor and/or live sites

Incident Report for Mono Solutions

Postmortem

Summary

Following a previous incident on the same day, a database instance at our back-ends in Frankfurt became unavailable. This rendered the back-ends to return 404s that would be subsequently cached at our CDNs. Once the issue was identified, the database was synced with the master and the back-end service was restored. There was a number of websites that required a cache clear in order for the 404s to become stale and get the correct content cached.

Impact

Editor inaccessible in Frankfurt
Non cached sites responding with 404s in Frankfurt

Root Causes

Redis slave instance in our VPC In Frankfurt was unresponsive.

Trigger

Master Redis instance was unavailable earlier.

Resolution

- Manually restart the service.

Action Items

Alert when the Redis slave is not available or not in sync with master.

Lessons learned

We need better alerting for Redis availability issues.

Timeline GMT+2

2021/06/28 12:28 CET First support reports

2021/06/28 12:28 CET Engineering engaged on the incident

2021/06/28 12:34 CET Incident announced on status page

2021/06/28 13:31 CET Root cause identified

2021/06/28 13:39 CET Back-end service restored

2021/06/28 13:46 CET Status Page updated

2021/06/28 13:46 CET Affected sites still need manual cache clearing upon reporting

2021/06/28 16:32 CET Incident resolved

Posted Jun 30, 2021 - 12:20 CEST

Resolved

This incident has been resolved.

Posted Jun 28, 2021 - 16:32 CEST

Update

We have had networking issues that made our Redis master slave replication fail. The issues is now identified and fixed, we are in the process of updating missing data entries

Posted Jun 28, 2021 - 14:04 CEST

Monitoring

A fix has been implemented and we are monitoring the service. Both sites and editor should be becoming operational again.
We will provide postmortem once all has been resolved.

Posted Jun 28, 2021 - 13:46 CEST

Investigating

We are currently investigating this issue.

Posted Jun 28, 2021 - 12:34 CEST

This incident affected: Website and Editor.