Media functionality across multiple products is affected

Incident Report for Jira Work Management

Postmortem

SUMMARY

On Mar 18, 2021, between 05:04 and 06:04 UTC, customers of Atlassian cloud products Bitbucket, Confluence, Jira Core, Jira Service Management and Jira Software had degraded user experience for attaching, viewing and uploading files. The event was triggered by a failure in a critical media platform service where a manual change to the database was applied without changing the code. The incident was detected within a few minutes at 05:08 by automated alerts and mitigated by deploying a fix to the faulty service as well as scaling up services, which put our systems into a known good state. The total time to resolution was approximately 60 minutes.

IMPACT

The incident involved a critical media platform service in the US region that was unable to process all incoming requests on Mar 18, 2021, between 05:04 and 06:04 UTC. This resulted in service disruption to customers where they were unable to view, upload and download attachments and files in Confluence, Jira Core, Jira Service Management, Jira Software products. Bitbucket Pipeline builds were also impacted with artifacts failing to upload during this period.

ROOT CAUSE

The incident was caused by a clean up task involving the removal of a database index. This was a manual change that was considered low risk. This change was performed without changing the code and introduced a bug that resulted in database lock behavior, which subsequently led to the critical media service failing to process all incoming requests.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages negatively impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified in our test and staging environments before production. The team implementing the database change considered it as low risk and the normal soaking time in staging was skipped.

Moving forward, to minimize the impact of changes, our team will ensure that all changes, including low risk ones, will be deployed via the same process as code changes, ensuring that at a minimum there is a 24 hour soaking time in staging environments.

This incident also surfaced a gap in our alerts in staging environments, so we will update our process to ensure that alerts are raised more timely.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the media platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Mar 25, 2021 - 23:05 UTC

Resolved

Between 2021-03-18 05:04 UTC to 2021-03-18 06:04 UTC, we experienced issues with file uploads and downloads for Confluence, Jira Core, Jira Software, and Atlassian Bitbucket. The issue has been resolved and the service is operating normally.

Posted Mar 18, 2021 - 07:21 UTC

Update

We have identified the root cause of the file upload and download issue and have mitigated the problem. We are now monitoring closely.

Posted Mar 18, 2021 - 06:22 UTC

Monitoring

We have identified the root cause of the file upload and download issue and have mitigated the problem. We are now monitoring closely.

Posted Mar 18, 2021 - 06:20 UTC

Identified

We continue to work on resolving the issue with File uploads/downloads for Confluence, Jira Core, Jira Software, and Atlassian Bitbucket. We have identified the root cause and expect recovery shortly.

Posted Mar 18, 2021 - 05:55 UTC

Investigating

We are investigating an issue with Media that is impacting Confluence, Jira Core, Jira Software, and Atlassian Bitbucket Cloud customers. We will provide more details within the next hour.

Posted Mar 18, 2021 - 05:33 UTC

This incident affected: Viewing content, Create and edit, and Marketplace.