Overview
At 1:50 PM AEST on the 4th August 2021, Servers Australia Engineers were alerted to a stack member reload on a core networking device in our Equinix SY1 footprint. Engineers immediately began investigating and after observing a second reload, isolated the cause to a sudden burst of broadcast traffic from a downstream interface which resulted in the stack members reloading one at a time.
Further analysis has confirmed that a stack member experienced an unexpected reload because of the increased strain from broadcast traffic, affecting additional services for a few minutes while it came back online. Engineers began planning an emergency firmware upgrade of the network device as well as implementing additional guards against any such bursts.
At 6:11 PM AEST Engineers were again alerted to a reload event on the same network device, however, unlike the previous events, with no broadcast traffic causing the reload, were forced to bring forward plans to complete an emergency firmware upgrade to avoid any further disruptions. Engineers worked directly with some customers utilising IP storage services to mitigate the impact and then at 8:06 PM AEST initiated the emergency firmware upgrade which was completed successfully at 8:11 PM AEST. Engineers monitored into the evening and were satisfied the emergency firmware upgrade had resolved the issue.
Timeline
13:50 - Secondary stack member unexpectedly reloads
13:55 - Secondary stack member re-joins topology
13:55 - Engineers begin investigating
13:56 - Primary stack member unexpectedly reloads
14:00 - Primary stack member re-joins topology
14:03 - Services back up after stack member return to service
14:19 - Primary stack member unexpectedly reloads
14:21 - Engineers shut down interface sending broadcast traffic
14:23 - Primary stack member re-joins topology
14:23 - Secondary stack member unexpectedly reloads
14:27 - Secondary stack member re-joins topology
14:30 - Services back up after core switches return to service
14:38 - Engineers individually review all port configurations on network device to eliminate any possible further issues.
15:00 - Engineers add additional downstream protection to all interfaces specific to the switching vendor affected.
16:30 - Engineers complete adding additional downstream protection to all interfaces.
18:11 - Engineers are alerted to core networking device reload events, this time not caused by broadcast issues.
18:22 - Services back up after core switches return to service
20:06 - Engineers enact an emergency firmware upgrade including a reload of the core switching device
20:11 - Firmware update complete and services return to operational state.
Further Action
An internal project to retire and replace the affected network device had already begun, though given this incident, NOC Engineers are working to speed this project up. The delivery of replacement hardware for this core switching upgrade depends on multiple third parties and the impact of COVID is causing supply chain delays from all major suppliers. As soon as a time for this replacement can be confirmed, we will share a planned maintenance notice on our status page https://status.mysau.com.au.