Trello is slow
Incident Report for Trello
Postmortem

On 2-10-2020 between 14:25 UTC and 15:18 UTC, Atlassian customers using Trello may have experienced service interruptions due to an incident.

This incident was caused by a CPU saturation on our production database. Two related factors contributed to this: 1) a sudden surge in connections from a limited set of IPs, 2) an abnormally high number of TCP connections to the database, which began failing, timing out, and generating more TCP connections, further increasing load. Our incident response team was alerted and we recovered from this situation by cutting off traffic to Trello—this reduced CPU load and allowed TCP connections to complete successfully. We then slowly allowed traffic back into Trello.

We are prioritizing the following actions to avoid repeating this type of incident:

  • Reducing the baseline number of connections from each database query router to the database nodes, which will reduce the average number of connections, as well as the number of connections created when reconnecting.
  • Limiting certain intensive usage patterns that we observed at the start of the incident.
  • Improving our recovery tools in order to decrease our time to resolve incidents.

We understand that outages negatively impact your productivity and we apologize for the inconvenience this has caused.

Posted Feb 22, 2021 - 16:20 EST

Resolved
This incident has been resolved.
Posted Feb 10, 2021 - 11:06 EST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 10, 2021 - 10:36 EST
Update
We are continuing to investigate this issue.
Posted Feb 10, 2021 - 10:14 EST
Investigating
We've noticed that Trello is responding slowly. This will be present in both the web and mobile apps.

Our engineering team is actively investigating this incident and working to bring Trello back up to speed as quickly as possible.

We'll keep you posted with further updates on this page.
Posted Feb 10, 2021 - 09:28 EST
This incident affected: Trello.com and API.