Network Disruption Sydney

Incident Report for Servers Australia

Postmortem

Overview

At 1:50 PM AEST on the 4th August 2021, Servers Australia Engineers were alerted to a stack member reload on a core networking device in our Equinix SY1 footprint. Engineers immediately began investigating and after observing a second reload, isolated the cause to a sudden burst of broadcast traffic from a downstream interface which resulted in the stack members reloading one at a time.

Further analysis has confirmed that a stack member experienced an unexpected reload because of the increased strain from broadcast traffic, affecting additional services for a few minutes while it came back online. Engineers began planning an emergency firmware upgrade of the network device as well as implementing additional guards against any such bursts.

At 6:11 PM AEST Engineers were again alerted to a reload event on the same network device, however, unlike the previous events, with no broadcast traffic causing the reload, were forced to bring forward plans to complete an emergency firmware upgrade to avoid any further disruptions. Engineers worked directly with some customers utilising IP storage services to mitigate the impact and then at 8:06 PM AEST initiated the emergency firmware upgrade which was completed successfully at 8:11 PM AEST. Engineers monitored into the evening and were satisfied the emergency firmware upgrade had resolved the issue.

Timeline

13:50 - Secondary stack member unexpectedly reloads
13:55 - Secondary stack member re-joins topology
13:55 - Engineers begin investigating
13:56 - Primary stack member unexpectedly reloads
14:00 - Primary stack member re-joins topology
14:03 - Services back up after stack member return to service

14:19 - Primary stack member unexpectedly reloads
14:21 - Engineers shut down interface sending broadcast traffic
14:23 - Primary stack member re-joins topology
14:23 - Secondary stack member unexpectedly reloads
14:27 - Secondary stack member re-joins topology
14:30 - Services back up after core switches return to service
14:38 - Engineers individually review all port configurations on network device to eliminate any possible further issues.

15:00 - Engineers add additional downstream protection to all interfaces specific to the switching vendor affected.
16:30 - Engineers complete adding additional downstream protection to all interfaces.

18:11 - Engineers are alerted to core networking device reload events, this time not caused by broadcast issues.
18:22 - Services back up after core switches return to service

20:06 - Engineers enact an emergency firmware upgrade including a reload of the core switching device
20:11 - Firmware update complete and services return to operational state.

Further Action

An internal project to retire and replace the affected network device had already begun, though given this incident, NOC Engineers are working to speed this project up. The delivery of replacement hardware for this core switching upgrade depends on multiple third parties and the impact of COVID is causing supply chain delays from all major suppliers. As soon as a time for this replacement can be confirmed, we will share a planned maintenance notice on our status page https://status.mysau.com.au.

Posted Aug 05, 2021 - 17:06 AEST

Resolved

This incident has been resolved.

Posted Aug 05, 2021 - 17:05 AEST

Update

The firmware upgrade was successful and has now been completed.

If you have any issues with your services please contact our help desk for assistance.

Posted Aug 04, 2021 - 20:18 AEST

Update

Engineers will push emergency firmware upgrades to our core switching environment at 8 PM AEST tonight in the Equinix SY1 facility.

During this time any services in the Equinix SY1 facility or running through this facility will experience a disruption of up to five minutes.

Posted Aug 04, 2021 - 19:47 AEST

Monitoring

Services are operational, however engineers will need to conduct further emergency works this evening to mitigate the chance of a reoccurrence. Further details will be shared when the maintenance window has been locked in.

Posted Aug 04, 2021 - 18:51 AEST

Investigating

Engineers have been alerted to a network disruption impacting our Sydney DCs. They are currently investigating the issue and will post further updates ASAP.

Posted Aug 04, 2021 - 18:30 AEST

This incident affected: Data Centres (Equinix SY1, Equinix SY3, Equinix SY4), Regions (Sydney), and Services (Network, Private Cloud, Dedicated Servers, Cloud Servers, Colocation).