Overview
At 12PM AEST on the 4th August 2021, Servers Australia technicians were alerted to a disruption to customers utilising legacy virtual firewall appliances isolated to a single hypervisor. Engineers were able to attend the site and investigate the cause. It was discovered that the node was inaccessible due to power works that were being carried out in the DC at the time (https://status.mysau.com.au/incidents/80tsfyxtkwvd). The onsite engineer was able to reroute and restore power to the hypervisor so technicians could concentrate on restoring connectivity to the firewall appliances. While a DR plan had been initiated, it was determined that a faster resolution would be achieved by restoring the firewall appliances on the primary hypervisor, so the plan was cancelled.
Connectivity was restored and works planned to adjust configurations to speed up the disaster recovery process in the future.
Timeline
12:00 - Power disruption to a hypervisor occurs. All services that utilise virtual firewalls on this hypervisor become unreachable.
12:20 - Engineers begin to enact a disaster recovery plan to bring up services on a replica hypervisor that remained online.
12:24 - Onsite technician isolates the fault and restores power to the hypervisor.
12:30 - As a part of the disaster recovery plan up to this point, some firewall appliances are powered on the replica hypervisor. These are later shut back down to restore all operations to the original production hypervisor.
12:40 - Engineers confirm the state of the original hypervisor is back to healthy. The disaster recovery plan is cancelled as it is estimated it would take more time to finalise the remainder of the DR cutover in comparison to returning the original hypervisor to production status.
12:50 - Firewall appliances begin to power on from the original hypervisor. Services begin to come back online throughout the next twenty minutes as firewalls come online.
13:13 - All monitored firewall appliances are confirmed online via internal monitoring tools.
Further Action
While all scheduled power works have been completed, the onsite engineer was able to adjust the power configuration of the impacted hypervisor to ensure it will not be impacted during any future power works within the datacentre. Additional work is also being carried out to reduce the time involved, should a DR plan need to be enacted in the future. Customers who wish to be moved to a newer, more robust solution, are advised to reach out to their account manager who can facilitate this request.