Service Interruption Related to Power Maintenance
Incident Report for Servers Australia
Postmortem

Overview

At 12PM AEST on the 4th August 2021, Servers Australia technicians were alerted to a disruption to customers utilising legacy virtual firewall appliances isolated to a single hypervisor. Engineers were able to attend the site and investigate the cause. It was discovered that the node was inaccessible due to power works that were being carried out in the DC at the time (https://status.mysau.com.au/incidents/80tsfyxtkwvd). The onsite engineer was able to reroute and restore power to the hypervisor so technicians could concentrate on restoring connectivity to the firewall appliances. While a DR plan had been initiated, it was determined that a faster resolution would be achieved by restoring the firewall appliances on the primary hypervisor, so the plan was cancelled.

Connectivity was restored and works planned to adjust configurations to speed up the disaster recovery process in the future.

Timeline

12:00 - Power disruption to a hypervisor occurs. All services that utilise virtual firewalls on this hypervisor become unreachable.
12:20 - Engineers begin to enact a disaster recovery plan to bring up services on a replica hypervisor that remained online.
12:24 - Onsite technician isolates the fault and restores power to the hypervisor.
12:30 - As a part of the disaster recovery plan up to this point, some firewall appliances are powered on the replica hypervisor. These are later shut back down to restore all operations to the original production hypervisor.
12:40 - Engineers confirm the state of the original hypervisor is back to healthy. The disaster recovery plan is cancelled as it is estimated it would take more time to finalise the remainder of the DR cutover in comparison to returning the original hypervisor to production status.
12:50 - Firewall appliances begin to power on from the original hypervisor. Services begin to come back online throughout the next twenty minutes as firewalls come online.
13:13 - All monitored firewall appliances are confirmed online via internal monitoring tools.

Further Action

While all scheduled power works have been completed, the onsite engineer was able to adjust the power configuration of the impacted hypervisor to ensure it will not be impacted during any future power works within the datacentre. Additional work is also being carried out to reduce the time involved, should a DR plan need to be enacted in the future. Customers who wish to be moved to a newer, more robust solution, are advised to reach out to their account manager who can facilitate this request.

Posted Aug 06, 2021 - 10:13 AEST

Resolved
This incident has been resolved.
Posted Aug 06, 2021 - 10:12 AEST
Monitoring
Engineers have confirmed the stability of the core networking device and will be monitoring the situation further. A report will be attached to this incident once investigative reports have been completed.
Posted Aug 04, 2021 - 15:19 AEST
Update
Engineers have confirmed the impacted core networking device is operational again and are continuing to review the affected device and cause of the disruption.
Posted Aug 04, 2021 - 14:43 AEST
Update
The engineering team has confirmed the core networking device has experienced a further reload resulting in further disruptions at this point in time. On-site engineers are continuing to investigate further.
Posted Aug 04, 2021 - 14:21 AEST
Update
Engineers have confirmed a core networking device in the Equinix SY1 datacentre experienced an unexpected reload impacting additional services for a short time whilst it came back online. A full investigation is underway into the cause of this disruption.
Posted Aug 04, 2021 - 14:13 AEST
Update
Engineers have confirmed additional services are impacted. We are currently working to identify the source of this further disruption and will continue to provide updates once the new cause of disruption has been identified.
Posted Aug 04, 2021 - 13:59 AEST
Identified
Engineers have received further monitoring alerts and are currently investigating the cause. At this stage this does not appear to be tied to the earlier service disruption reported.
Posted Aug 04, 2021 - 13:52 AEST
Monitoring
Engineers have restored services on the original hypervisor providing firewall services and further investigations are now underway in relation to the cause of this incident in relation to the planned maintenance.

Further information will be provided in a final update.
Posted Aug 04, 2021 - 13:13 AEST
Identified
Engineers have confirmed a hypervisor providing virtualised firewalls services is impacted as a result of the power maintenance. Work is currently now underway to return the firewall services to operation as soon as possible.
Posted Aug 04, 2021 - 12:23 AEST
Investigating
Engineers have received monitoring alerts in relation to the organised power maintenance and are commencing an investigation into the cause. At this point in time Engineers believe this is isolated to a single hypervisor that provides virtualised firewall services.

Further information will be provided as soon as possible.
Posted Aug 04, 2021 - 12:10 AEST
This incident affected: Regions (Sydney).