Post-Mortem: Elevated Error Rates on Channels mt1 Cluster

Thursday, March 24 at 11:55 UTC we saw an increase of 500 errors in the Channels REST API. The incident lasted 40 minutes, during which time some requests for new messages failed.

Incident timeline

(All times are listed in UTC on the 24th of March 2022)

At 11:50, we started observing warnings related to latency.

At 11:55, we noticed an increase in the error rate on Channels API. Our incident responders raised an incident and we saw an increase in cpu utilisation on our Redis nodes.

At 12:45, the system was stable again.

What was the impact on end-users?

Between 11:50 and 12:30 UTC users may have experienced:

Failure to publish a message
Increased connection time

What was the root cause?

A few days before the accident, we significantly scaled up the MT1 Kubernetes cluster, which increased load on Kubernetes control planes. The engineers made additional changes to the horizontal and vertical scale controllers and the API server to manage this. In the meantime, we decided to keep the old socket VMs alongside the new setup so that we could get traffic back to those instances in the event of any problems.

On March 24, as traffic on the MT1 clusters increased, Kubernetes was scheduling more socket pods to accommodate the load, while all older socket virtual machines continued running. This affected our Redis cluster and caused some latency problems.

To eliminate extra load from the Redis cluster, we simply removed some of the legacy socket VMs. We kept some of them running as part of our migration strategy.

How will we ensure this doesn’t happen again?

We have added additional capacity to the MT1 cluster both vertically and horizontally. The MT1 cluster is now routing most of its traffic to our new infrastructure, which has been made more scalable and resilient.

Posted Apr 12, 2022 - 17:01 UTC

Resolved

We have identified and resolved the cause of this issue, post mortem to follow

Posted Mar 24, 2022 - 13:59 UTC

Investigating

We are currently investigating this issue.

Posted Mar 24, 2022 - 12:43 UTC

This incident affected: Channels REST API.