Time line
On Wednesday June 9th the hosting provider for our UK1 hosting location executed planned routine maintenance on several of our servers. During the maintenance, servers were taken out of the production pool one by one.
At 1:45PM BST, during maintenance actions on one of the final servers, another server crashed due to an unrelated hardware failure. Because of this crash, some virtual machines were not moved to a healthy host in time. This caused downtime for some environments.
At 2:13PM SBT the maintenance was completed and our servers were back online. Most TOPdesk environments recovered quickly, but some took up to 45 minutes to start correctly.
While we were starting TOPdesk environments and verifying all services were back online, several customers reported their TOPdesk environment was much slower than usual after the disruption. We investigated the root cause of this problem, and found that due to the disruption one firewall ended up in an inconsistent state, causing traffic to that specific node to be dropped.
At 4:10PM BST all firewall configurations were restored to the correct state and performance was again back to normal.
Follow up actions
Monitoring on the firewall state is being implemented, allowing us to notice this issue much sooner should it occur again. We’re also working to update our firewall configuration to ensure the inconsistent state issue can’t reoccur.
A team is investigating what can be done to reduce the time it takes to start TOPdesk environments. This will reduce the recovery time for future disruptions.