Confluence and Jira - Degraded Editing, Commenting and Media Experience in APSE1
Incident Report for Jira Software
Postmortem

SUMMARY

On May 21, 2021, between 06:07 - 06:40 UTC, customers of Confluence Cloud, Jira Software, Jira Service Management and Jira Work Management in the Asia Pacific South East region experienced slow performance and/or a complete loss of service. The event was triggered by an unintended reduction in capacity of a network proxy service used to process requests for these Atlassian cloud products. Capacity was reduced for customers in Asia Pacific South East region only.  The incident was detected within 3 minutes by automated monitoring systems and mitigated by increasing proxy service capacity to a level sufficient to handle the traffic volume. During this scale-up period, customer traffic was also redirected to other regions to expedite return of service. The total time to resolution was approximately 33 minutes.

IMPACT

The overall impact was on May 21, 2021, between 06:07 AM UTC and 06:40 AM UTC, on Confluence Cloud, Jira Software, Jira Service Management and Jira Work Management products. The incident caused service disruption to Asia Pacific South East customers only, where they attempted to reach their service and the request was handled by the network proxy, which was under-provisioned and unable to suitably scale to handle all customers requests successfully.  Atlassian customers would have seen specific impact as follows:

Confluence Cloud:

Confluence Cloud customers encountered impact to actions including: creating, viewing, editing, commenting on pages. At its peak, the most affected action was editing, where up to 18% of user requests failed. Once the incident was resolved, all actions returned to normal. 

Jira:

Jira customers' usage is represented through "issue" usage.  We observed impact to the throughput of issues - initially a 40% reduction on typical volumes - and subsequently, a recovery of request volumes, where up to 2.5% user requests failed. Those impacts were observed as either gateway errors or in-app failure notifications by customers of Jira Service Management, Jira Work Management and Jira Software for the duration of the incident.

Proxy Service:

Atlassian service handling the proxy of requests from customers encountered a 12% failure rate on requests, compared with traffic globally.  The impact for the service was localised to Asia Pacific.  Subsequently, the proxy service saw a 30% reduction in traffic in the absence of follow-up requests to the proxy that would have occurred in normal operation.

ROOT CAUSE

The issue was caused by an automated system maintenance operation that performed a service redeploy, but with incorrect capacity parameters. Normally, the system would provision new proxy instances equal in number to the existing cohort so that load can be transitioned cleanly to updated instances without disrupting customer connectivity. On this occasion, the new proxy cluster entered service with only one third of the instances available to handle requests. With so few nodes available, application load saturated available system resources, particularly CPU. As a result, traffic was dropped and end-users experienced network timeouts or 500 errors.

A pre-existing bug was discovered as the cause of this incident, and on inspection of our logs we discovered that these provisioning shortfalls had been occurring for other systems. In these previous instances the bug was masked either by a more gradual transition of traffic between old and new cohorts, by AWS auto-scaling, or because it happened during periods of low service request volume. 

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages are impactful to your productivity.   

We are prioritizing the following improvement actions to avoid the recurrence of a similar incident in the future:

  • We have revised the workflow used in automated redeployments to ensure consistency with the workflow available to service owners.
  • We are working to provide Atlassian services with an extended scaling policy option for application in an emergency.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jun 01, 2021 - 16:34 UTC

Resolved
Between 05:09 UTC to 05:29 UTC, customers experienced degraded editing and commenting for Confluence, Jira Work Management, Jira Service Management, Jira Software, and Jira Align. The issue has been resolved and the service is operating normally.
Posted May 21, 2021 - 06:01 UTC
Investigating
We are investigating reports of intermittent errors for some Confluence, Jira Work Management, Jira Service Management, Jira Software, and Jira Align Cloud customers. We will provide more details once we identify the root cause.
Posted May 21, 2021 - 05:40 UTC