Degraded performance in US data center

Incident Report for Cronofy

Postmortem

On Tuesday, February 22nd 2022 our US data center experienced 95 minutes of degraded performance between 15:45 and 17:20 UTC.

This was caused by the primary PostgreSQL database hitting bandwidth limits and its performance being throttled as a result. This was caused or exacerbated by PostgreSQLs internal housekeeping working on two of our largest tables at the same time.

To our customers this would have surfaced as interactions with the US Cronofy platform, i.e. using the website or API, being much slower than normal. For example, the 99th percentile of API response times is usually around 0.5 seconds and during this incident peaked around 14 seconds.

We have upgraded the underlying instances of this database, broadly doubling capacity and putting us far from the limit we were hitting.

Timeline

All times UTC on Tuesday, February 22nd 2022 and approximate for clarity.

15:45 Our primary database in our US data center started showing signs of some performance degradation.
16:05 First alert received by the on-call engineer for a potential performance issue.
Attempts to reduce load on the database through interventions such as temporarily disabling some of its background housekeeping processes.
16:45 Incident opened on our status page informing customers of degraded performance in the US data center.
17:00 Began provisioning more capacity for the primary database as a fallback plan if efforts continued to be unsuccessful.
17:10 New capacity available.
17:15 Failed over to fully take advantage of the new capacity by promoting the larger node to be the writer.
17:20 Performance had returned to normal levels in the US data center.
17:45 Decided we could close the incident.
18:00 Decided to lock in the capacity change and provisioned an additional reader node at the new size.
18:15 Removed the smaller nodes from the database cluster.

Actions

Whilst there was not an outage, this felt like a close call for us. This led to three key questions:

Why had we not foreseen this capacity issue?
Could the capacity issue have been prevented?
Why had we not resolved the issue sooner?

Foreseeing the capacity issue

We had recently performed a major version upgrade on this database, and in the following weeks monitored performance pretty closely. If there was a time we should have spotted a potential issue in the near future, this was such a time.

We believe we may have focussed too heavily on CPU and memory metrics in our monitoring, and it was networking capacity that led to this degradation in performance. We will be reviewing our monitoring to set alerts that would have pointed us in the right direction sooner, and also lower priority alerts that would flag an upcoming capacity issue days or weeks in advance.

Preventing the capacity issue

As PostgreSQL internal housekeeping processes appeared to contribute significantly to the problem, we will be revisiting the configuration of these process and seeing if they can be altered to reduce the likelihood of such an impact in future.

Resolving the issue sooner

As this was a performance degradation rather than an outage, the scale of the problem was not clear. This led to the on-call engineer investigating the issue whilst performance degraded further without additional alerts being raised.

We will be adding additional alerts relating to performance degradation in several subsystems to raise awareness of the impact of a problem to an on-call engineer.

We are also updating our guidance on incident handling for the team to encourage switching to a more visible channel for communication sooner. We are also encouraging the escalation of alerts to involve other on-call engineers in the process, particularly when the cause is not immediately clear.

Further questions?

If you have any further questions, please contact us at support@cronofy.com

Posted Feb 25, 2022 - 09:53 UTC

Resolved

Around 15:45 UTC our primary database in our US data center started showing signs of some performance degradation. We first received an alert at around 16:05 UTC as this problem grew more significant.

We made attempts to reduce load on the database through interventions such as temporarily disabling some of its background housekeeping processes. Often giving such breathing room will allow a database to recover by itself.

Around 16:45 UTC it appeared our efforts were not bearing fruit, and as the performance of our US data center was degraded from normal levels we opened an incident to make it clear we were aware of the situation.

Around 17:00 UTC we decided to provision more capacity for the cluster in case it was necessary, this took around 10 minutes to come online.

Whilst that was provisioning, we reduced the capacity of background workers temporarily to see if that would clear the problem by reducing the load. This was unsuccessful and so around 17:15 UTC we decided to failover to the new cluster capacity, after 5 minutes this had warmed and performance had returned to normal levels.

There was a brief spike in errors from the US data center as a side effect of the failover, but otherwise the service was available throughout, albeit with degraded performance.

We will be conducting a postmortem of this incident and will share our finding by the end of the week.

Posted Feb 22, 2022 - 17:58 UTC

Identified

Our primary database is the source of the degraded performance, we have provisioned additional capacity to the cluster and failed over to make a new, larger node the primary one.

Early signs are positive and we are monitoring the service.

Posted Feb 22, 2022 - 17:19 UTC

Investigating

We are investigating degraded performance in our US data center.

Posted Feb 22, 2022 - 16:51 UTC

This incident affected: API and Background Processing.