SaaS disruption in UK1 hosting location

Incident Report for TOPdesk SaaS Status page

Postmortem

Time line

On Wednesday June 9th the hosting provider for our UK1 hosting location executed planned routine maintenance on several of our servers. During the maintenance, servers were taken out of the production pool one by one.

At 1:45PM BST, during maintenance actions on one of the final servers, another server crashed due to an unrelated hardware failure. Because of this crash, some virtual machines were not moved to a healthy host in time. This caused downtime for some environments.

At 2:13PM SBT the maintenance was completed and our servers were back online. Most TOPdesk environments recovered quickly, but some took up to 45 minutes to start correctly.

While we were starting TOPdesk environments and verifying all services were back online, several customers reported their TOPdesk environment was much slower than usual after the disruption. We investigated the root cause of this problem, and found that due to the disruption one firewall ended up in an inconsistent state, causing traffic to that specific node to be dropped.

At 4:10PM BST all firewall configurations were restored to the correct state and performance was again back to normal.

Follow up actions

Monitoring on the firewall state is being implemented, allowing us to notice this issue much sooner should it occur again. We’re also working to update our firewall configuration to ensure the inconsistent state issue can’t reoccur.

A team is investigating what can be done to reduce the time it takes to start TOPdesk environments. This will reduce the recovery time for future disruptions.

Posted Jun 25, 2021 - 11:57 CEST

Resolved

This morning the hosting provider executed planned routine maintenance on several of our servers. During the maintenance, servers were taken out of the production pool one by one.

At 1:45PM BST, during maintenance actions on one of the final servers, another server crashed due to hardware failure. Because of this crash, some virtual machines were not moved to a healthy host in time. This caused downtime for some environments.

At 2:13PM SBT the maintenance was completed and our servers were back online. Most TOPdesk environments recovered quickly, but some took up to 45 minutes to start correctly.

While we were starting TOPdesk environments and verifying all services were back online, several customers reported their TOPdesk environment was much slower than usual after the disruption. We investigated the root cause of this problem, and found that due to the disruption one firewall ended up in an inconsistent state, causing traffic to that specific node to be dropped.

At 4:10PM BST all firewall configurations were restored to the correct state and performance was again back to normal.

If you still encounter problems while working in your TOPdesk environment, please contact TOPdesk Support.

Posted Jun 09, 2021 - 17:28 CEST

Update

Our engineers have found the root cause of the slow performance and are working to resolve the issue.

Posted Jun 09, 2021 - 17:07 CEST

Update

All TOPdesk environments are back online and all services are running. We're still investigating reports of slowness occurring after the disruption.

Posted Jun 09, 2021 - 16:30 CEST

Update

During planned routine maintenance on one of our servers by the hosting provider, another server crashed unexpectedly. With 2 servers down at the same time, not all virtual machines could be moved to an active server, causing downtime for several customer environments.

The maintenance is completed and our servers are back online. We're still in the process of restoring the final customer environments.

Posted Jun 09, 2021 - 15:29 CEST

Investigating

We are currently experiencing problems on the UK1 hosting location. As a result your TOPdesk environment may not be available.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR21 06 2694.

Posted Jun 09, 2021 - 14:54 CEST

This incident affected: UK1 SaaS hosting location.