On June 22, 2022, from 01:08 AM UTC to 03:42 AM UTC, some customers using Bitbucket Pipelines, Confluence Cloud, Forge, and Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management). While for Bitbucket Pipelines there was an increase in build failures, Jira, Confluence, and Forge experienced performance and functionality degradation. The event was triggered by our internal Artifact Repository Manager becoming unavailable during a scheduled multi-availability zone disaster recovery test. Customers across all regions were affected. The incident was detected within two minutes by monitoring and mitigated by restarting the Artifact Repository service, which recovered the affected products. The total time to resolution was about three hours.
The overall impact was between June 22, 2022, 01:08 AM UTC, and June 22, 2022, 05:58 AM UTC on Bitbucket Pipelines, Confluence Cloud, Forge, and Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management). The outage of the internal Artifact Repository Manager caused scalability problems in the aforementioned products and an inability to build or deploy new versions of our services. That meant the degradation of performance and functionality for most of these products.
The issue was caused by an outage of the internal Artifact Repository Manager during the planned multi-availability zone disaster recovery test. As a result, the products listed above could not access docker images and other necessary artifacts to scale up, which caused partial degradation of services or complete unavailability of services for some customers. The restart of the internal Artifact Repository Manager caused downtime to the service but led to successful recovery.
We know that outages may impact your productivity. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid the impact of this kind of outage in the future.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
To minimize the impact of such incidents on our customers, we will implement additional preventative measures such as:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support