Trello is slow or unavailable for some users
Incident Report for Trello
Postmortem

SUMMARY

On August 11, 2022 at 10:00pm UTC, a portion of Trello users experienced a slow or degraded experience with the product. The event was triggered by a sudden increase in load on Trello's MongoDB data store, saturating the database's resources and causing it to become slow or unresponsive to queries. The incident was mitigated by disabling a feature flag that had allowed a recently deployed code path to execute. The time to resolution was 2 hours and 5 minutes.

IMPACT

Beginning on August 11, 2022 at 10:00pm UTC and extending to August 12, 2022 at 12:05am UTC (TTR of 2 hours and 5 minutes), Trello became slow or unresponsive for ~31% of users. During this time, all Trello functionality either loaded slowly or did not load at all. The incident was detected within 3 minutes by automated monitoring and was mitigated at 11:53pm UTC when the incident response team disabled an offending feature flag, terminating a code path that was causing the increased load on the database. By 12:05am UTC on August 12, 2022, full functional  was restored for all users.

ROOT CAUSE

The issue was caused by a change to Trello's server codebase that introduced a new write pattern to Trello's MongoDB data store. The change, coupled with an unexpected interaction with MongoDB's balancer (a system that balances data across MongoDB nodes) caused a sudden and significant spike in writes to the database. This, in turn, quickly overloaded MongoDB resources, rendering the database unable to respond to requests within a reasonable amount of time. The root cause was the introduction of the code change without an incremental rollout.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that events such as this impact your productivity. While we have a number of testing and preventative processes in place, we were not able to simulate the unexpected interaction that caused this event during testing prior to deployment to the production environment.

We are prioritizing the following improvements to avoid repeating this type of incident:

  • Evaluating the configuration of the MongoDB balancer to avoid similar issues with existing or new write patterns.
  • Fixing the code that introduced the new write pattern.
  • Establishing a consistent pattern for incrementally deploying and testing changes behind feature flags.

We apologize to customers whose services were impacted during this event, and we are taking immediate steps to improve Trello's performance and availability going forward.

Thanks,

Trello

Posted Aug 19, 2022 - 10:19 EDT

Resolved
This incident has been resolved.
Posted Aug 11, 2022 - 20:31 EDT
Update
We are continuing to monitor for any further issues.
Posted Aug 11, 2022 - 20:28 EDT
Monitoring
Trello is operational. We'll continue to investigate the root cause and monitor until the issue is resolved.
Posted Aug 11, 2022 - 20:19 EDT
Update
Trello servers are recovering, but still may be slow. We're working to identify the issue and continuing to monitor.
Posted Aug 11, 2022 - 19:39 EDT
Investigating
Our engineering team is actively investigating this incident and working to bring Trello back up as quickly as possible.
Users affected by this incident may notice that Trello is slow or completely unavailable in both the web and mobile apps.

We will update this page as we have additional information.
Posted Aug 11, 2022 - 18:32 EDT
This incident affected: Trello.com and API.