On Monday 10-18-2021 13:03–16:10 UTC Atlassian customers using Trello may have experienced service interruptions. These service interruptions were caused by one of the shards of our primary database hitting 100% CPU utilization due to a change in request shape and volume from our web client. We resolved this by disabling a significant portion of traffic to Trello, and then restoring it gradually as we monitored the recovery. While the total time to restore Trello to full service for all customers was approximately 3 hours and 7 minutes, many customers experienced fully restored access much sooner, as we allowed traffic back in incrementally.
This service interruption was similar to the issues we saw on 9/20 and 9/21, in that it occurred at a peak traffic time, was due to more load than our primary database could handle, and resulted in a full service interruption for a short period of time, followed by a partial service interruption for a longer period of time while we restored service. While the service interruptions were similar, the root cause this time was a build-up of load on our primary database due to a change in request shape and volume from our web client, rather than a bug in our primary database software. To reduce load on our primary database in the future, we rolled out changes to our web client as we simultaneously restored service.
We have mobilized a multi-team effort to address the root causes of these types of incidents, and expect to significantly reduce the likelihood of them in the future. Here are the streams of work that are in progress:
We know that outages are disruptive to your productivity and we are placing the the prevention of future service interruptions at the top of our priorities.