Trello is slow or unavailable

Incident Report for Trello

Postmortem

On Monday 10-18-2021 13:03–16:10 UTC Atlassian customers using Trello may have experienced service interruptions. These service interruptions were caused by one of the shards of our primary database hitting 100% CPU utilization due to a change in request shape and volume from our web client. We resolved this by disabling a significant portion of traffic to Trello, and then restoring it gradually as we monitored the recovery. While the total time to restore Trello to full service for all customers was approximately 3 hours and 7 minutes, many customers experienced fully restored access much sooner, as we allowed traffic back in incrementally.

This service interruption was similar to the issues we saw on 9/20 and 9/21, in that it occurred at a peak traffic time, was due to more load than our primary database could handle, and resulted in a full service interruption for a short period of time, followed by a partial service interruption for a longer period of time while we restored service. While the service interruptions were similar, the root cause this time was a build-up of load on our primary database due to a change in request shape and volume from our web client, rather than a bug in our primary database software. To reduce load on our primary database in the future, we rolled out changes to our web client as we simultaneously restored service.

We have mobilized a multi-team effort to address the root causes of these types of incidents, and expect to significantly reduce the likelihood of them in the future. Here are the streams of work that are in progress:

Improving resource tagging, monitoring, dashboards and database metric anomaly detection for all database instances
Reviewing existing database capacity to lower the likelihood of a near-term outage triggered from over utilized instances
Creating tooling and processes to support engineers who are building new features to determine any database performance impact
Making optimizations to reduce database load, including re-evaluating existing usage limits to certain features
Enhancing tooling to enable faster recovery in case of database performance issues in production
Completing the upgrade of our primary database system, which contains a critical connection bug fix that contributed to the incidents on 9/20 and 9/21

We know that outages are disruptive to your productivity and we are placing the the prevention of future service interruptions at the top of our priorities.

Posted Oct 26, 2021 - 13:15 EDT

Resolved

This incident has been resolved.

Posted Oct 18, 2021 - 13:28 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 18, 2021 - 12:42 EDT

Update

We are continuing to investigate this issue.

Posted Oct 18, 2021 - 12:03 EDT

Update

Our team is continuing to investigate this issue and will give you updates as soon as we have them.

Posted Oct 18, 2021 - 11:34 EDT

Update

Trello’s Engineering team is still working to identify the problem, and we’ll be back to normal as soon as we can!

Posted Oct 18, 2021 - 10:52 EDT

Update

We are continuing to investigate this issue.

Posted Oct 18, 2021 - 10:15 EDT

Update

We're continuing to investigate this issue and will give you updates as soon as we have them.

Posted Oct 18, 2021 - 09:58 EDT

Investigating

Trello is currently slow or unavailable.

Our engineering team is actively investigating this incident and working to bring Trello back up as quickly as possible.

Users affected by this incident may notice that Trello is slow or completely unavailable in both the web and mobile apps.

We will update this page as we have additional information.

Posted Oct 18, 2021 - 09:23 EDT

This incident affected: Trello.com and API.