On Mar 18, 2021, between 05:04 and 06:04 UTC, customers of Atlassian cloud products Bitbucket, Confluence, Jira Core, Jira Service Management and Jira Software had degraded user experience for attaching, viewing and uploading files. The event was triggered by a failure in a critical media platform service where a manual change to the database was applied without changing the code. The incident was detected within a few minutes at 05:08 by automated alerts and mitigated by deploying a fix to the faulty service as well as scaling up services, which put our systems into a known good state. The total time to resolution was approximately 60 minutes.
The incident involved a critical media platform service in the US region that was unable to process all incoming requests on Mar 18, 2021, between 05:04 and 06:04 UTC. This resulted in service disruption to customers where they were unable to view, upload and download attachments and files in Confluence, Jira Core, Jira Service Management, Jira Software products. Bitbucket Pipeline builds were also impacted with artifacts failing to upload during this period.
The incident was caused by a clean up task involving the removal of a database index. This was a manual change that was considered low risk. This change was performed without changing the code and introduced a bug that resulted in database lock behavior, which subsequently led to the critical media service failing to process all incoming requests.
We know that outages negatively impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified in our test and staging environments before production. The team implementing the database change considered it as low risk and the normal soaking time in staging was skipped.
Moving forward, to minimize the impact of changes, our team will ensure that all changes, including low risk ones, will be deployed via the same process as code changes, ensuring that at a minimum there is a 24 hour soaking time in staging environments.
This incident also surfaced a gap in our alerts in staging environments, so we will update our process to ensure that alerts are raised more timely.
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the media platform’s performance and availability.
Thanks,
Atlassian Customer Support