On December 16, 2022, between 10:07 and 15:40 UTC, Atlassian's internal artifact management infrastructure experienced an outage. Some Atlassian customers using Bitbucket Pipelines, Marketplace apps and integrations, and managing attachments across Atlassian cloud products were impacted. Bitbucket users experienced failures of Bitbucket pipelines and Bitbucket pipeline runners, Marketplace apps and integrations users experienced missing webhooks, and actions on attachments (especially uploads) were impacted across all our products. The event was triggered by Atlassian's internal Artifact Repository Manager becoming unavailable due to a combination of an abnormally high load and misconfiguration of rate limiting and circuit breaking. Customers in all regions were impacted. The incident was immediately detected by our monitoring systems and mitigated by changing policies and configuration, which allowed the Artifact Repository Manager to recover. The total time to resolution was about five hours and 33 minutes.
We detected the impact of this incident on December 16, 2022, at 10:07 UTC, recovery started at 11:00 UTC with most of the functionality restored by 12:12 UTC, and full recovery was achieved at 15:40 UTC.
Below is the breakdown of the impact for each product.
Marketplace apps and integrations:
Bitbucket:
All Atlassian cloud products:
The issue was caused by an outage of Atlassian's internal Artefact Repository Manager due to a combination of an abnormally high load and misconfiguration of rate limiting and circuit breaking. As a result, the products listed above could not access Docker images and other necessary artifacts to scale up, which caused partial degradation of some services or complete unavailability of some other services for customers. The restart of the internal Artifact Repository Manager and changes in policies and configuration caused downtime to the service but led to successful recovery.
We know that outages impact your productivity. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid the impact of this kind of outage in the future.
We are prioritizing the following improvement actions to minimize the likelihood of incidents of this type reoccurring:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support