On the 6th of August 2022 at 16:32 AEST, Squiz internal monitoring detected a degradation of services for a few customers hosted in our Melbourne Data Centre.
The Squiz Data Centre team was engaged and identified that a few of the Production and Disaster Recovery (DR) VMs stopped being operational. Performed actions included a restart of these impacted VM’s, promoting recovery at 17:45 AEST.
Between 16:32 AEST and 17:45 AEST on the 6th of August 2022, some customers hosted in the Melbourne Data Centre may have experienced a degradation of service as their VM’s were unreachable.
As part of a physical server relocation activity in the Melbourne Data Centre, at ~ 12:28 AEST on the 6th of August 2022, a server was shut down and physically moved to its new position. At ~15:12 AEST this server was reconnected to the network and powered up. No issues were observed and internal monitoring was clear. At 16:30 AEST the main dashboard hosted in the Melbourne Data Centre depicting VM information started showing errors. As a result our Data Centre team restarted a certain process running on the compute node, which, at 16:32 AEST, resulted in a few VM’s entering into a “paused” state. The impacted VM’s were rebooted one by one resulting in partial recovery at 17:15 AEST. The last VM was successfully rebooted at 17:45 AEST, resulting in a complete recovery.
In response to this Incident, the Squiz Data Centre team is undertaking the below actions: