On Tuesday, February 22nd 2022 our US data center experienced 95 minutes of degraded performance between 15:45 and 17:20 UTC.
This was caused by the primary PostgreSQL database hitting bandwidth limits and its performance being throttled as a result. This was caused or exacerbated by PostgreSQLs internal housekeeping working on two of our largest tables at the same time.
To our customers this would have surfaced as interactions with the US Cronofy platform, i.e. using the website or API, being much slower than normal. For example, the 99th percentile of API response times is usually around 0.5 seconds and during this incident peaked around 14 seconds.
We have upgraded the underlying instances of this database, broadly doubling capacity and putting us far from the limit we were hitting.
All times UTC on Tuesday, February 22nd 2022 and approximate for clarity.
15:45 Our primary database in our US data center started showing signs of some performance degradation.
16:05 First alert received by the on-call engineer for a potential performance issue.
Attempts to reduce load on the database through interventions such as temporarily disabling some of its background housekeeping processes.
16:45 Incident opened on our status page informing customers of degraded performance in the US data center.
17:00 Began provisioning more capacity for the primary database as a fallback plan if efforts continued to be unsuccessful.
17:10 New capacity available.
17:15 Failed over to fully take advantage of the new capacity by promoting the larger node to be the writer.
17:20 Performance had returned to normal levels in the US data center.
17:45 Decided we could close the incident.
18:00 Decided to lock in the capacity change and provisioned an additional reader node at the new size.
18:15 Removed the smaller nodes from the database cluster.
Whilst there was not an outage, this felt like a close call for us. This led to three key questions:
We had recently performed a major version upgrade on this database, and in the following weeks monitored performance pretty closely. If there was a time we should have spotted a potential issue in the near future, this was such a time.
We believe we may have focussed too heavily on CPU and memory metrics in our monitoring, and it was networking capacity that led to this degradation in performance. We will be reviewing our monitoring to set alerts that would have pointed us in the right direction sooner, and also lower priority alerts that would flag an upcoming capacity issue days or weeks in advance.
As PostgreSQL internal housekeeping processes appeared to contribute significantly to the problem, we will be revisiting the configuration of these process and seeing if they can be altered to reduce the likelihood of such an impact in future.
As this was a performance degradation rather than an outage, the scale of the problem was not clear. This led to the on-call engineer investigating the issue whilst performance degraded further without additional alerts being raised.
We will be adding additional alerts relating to performance degradation in several subsystems to raise awareness of the impact of a problem to an on-call engineer.
We are also updating our guidance on incident handling for the team to encourage switching to a more visible channel for communication sooner. We are also encouraging the escalation of alerts to involve other on-call engineers in the process, particularly when the cause is not immediately clear.
If you have any further questions, please contact us at support@cronofy.com