SaaS disruption UK1 environments unavailable
Incident Report for TOPdesk SaaS Status page
Postmortem

Summary
On the 23rd and 24th of August, 2021 all customers with TOPdesk SaaS environments hosted in our UK1 hosting location experienced a series of intermittent disruptions which caused their TOPdesk environment to become unreachable. After almost 2 days of troubleshooting and investigating with engineers from both our Content Delivery Network (CDN) partner and our hosting provider, a faulty configuration was identified. 
To mitigate the impact while working on a solution, a temporary alternate routing workaround was set up to circumvent this component. The workaround was stable and customers could again work in their TOPdesk environment, while we continued our investigation into the root cause. After testing the proposed solution, the faulty component was fixed and the temporary re-routing was reverted in order to create a permanent solution to this problem.
Upon evaluation, several points of improvement have been identified in a number of areas in order to prevent this issue occurring again, and to be able to improve the speed at which we can troubleshoot and resolve issues of this nature moving forward. These improvements include - but are not limited to - our information gathering and communication processes both with our hosting partners and to impacted customers.

Infrastructure:
To provide insight into this issue, the below diagram depicts how traffic flows to TOPdesk environments:

https://www.topdesk.com/wp-content/media/saas.uk1rout.png

Root cause:
Traffic to our UK1 data center first gets routed through our Content Delivery Network (CDN), where it is then directed towards reverse proxy servers in the correct data center for the target TOPdesk environment. On it's path, at the boundary between the Internet and the hosting infrastructure where our reverse proxies run, is a physical firewall (a redundant pair). This firewall had a default SYN-flood limit configured of 1024 new connections per second. Neither our hosting provider nor the TOPdesk SaaS Operations team were aware of the existence of a SYN-flood limit in this firewall. 

Due to the increase in traffic to the UK1 hosting location, this limit was briefly first reached on Friday August 20th and again on August 23rd and 24th. The firewall temporarily blocked new connections for a time until the connection rate decreased below a threshold where it would allow connections again. By re-routing traffic via a different hosting location (and edge firewall) the traffic limit was circumvented and the issue no longer occurred. The issue was resolved by adjusting the SYN-flood limit configuration.

Note that the information communicated via our Status page on August 20th indicated the cause of this issue was most likely to lie with our CDN partner, and that we were seeking further information from them. This conclusion was reached because the monitoring information available to us at that time indicated an issue which was consistent with previous CDN-related disruptions.

Time line:
The below time line details the steps taken during this disruption. We keep track of all our actions to properly evaluate disruptions. In this time line we refer to the TOPdesk hosting team as SaaS Operations, the Content Delivery Network as CDN and the hosting provider as HP. All times are in Central European Summer Time / CEST (GMT +1 / UTC +2).

Friday 20-08-2021
13:52 A drop in traffic to our reverse proxies on UK1 was noticed in the monitoring.
13:57 SaaS Operations team starts an investigation as we can see traffic is not reaching our Reverse proxies as expected.
14:01 A ticket with our CDN created requesting further information.
14:12 Traffic to UK1 is restored.
15:11 CDN indicates that the disruption was on their side and the ticket was closed. TOPdesk requested an RCA and planned an evaluation.

Below an image of the active connection to the UK1 reverse proxies on Friday:

https://www.topdesk.com/wp-content/media/saas.uk1activeconnections20.png

Monday 23-08-2021
9:00 TOPdesk SaaS monitoring indicated that multiple UK1 environments were not accessible. Our monitoring checks (amongst others) the availability of the host name, and these could not be resolved for UK1 environments.
9:17 Some re-routing is reported for CDN points of presence in UK and Ireland at the moment according to their status page. SaaS Operations confirms the environments are running but cannot be reached from outside the network.
9:22 The reverse proxy on UK1 drops noticeably in TCP connections.
9:28 SaaS Operations creates a support request for the CDN. Similar errors and symptoms are seen on our end.
9:49 The SaaS Operations team noticed a diverging A record for one of the reverse proxies. This doesn't seem to be the issue at hand, because nothing indicates it has been changed in the last months, and the reverse proxy does show traffic at times.
10:06 The network to the data center seems to be operational but there is no traffic coming in on the reverse proxies.
10:09 The SaaS Operations team called with engineers from the CDN; an engineer would look into it and reply in the previously created ticket.
11:07 The CDN responds and picks up the tickets.
11:31 All connections on the reverse proxies are coming back and environments are reachable again.
12:55 The issue re-occurs and the CDN was called. They responded with steps to troubleshoot.
13:23 SaaS Operations team contacted the UK1 HP, to see if they could see anything on their end to rule out an alternative root cause.
14:39 SaaS Operations team contacted the development team responsible for the authentication service and another call made to UK HP to investigate.
15:00 Environments seem to be available again.
15:23 The SaaS Operations team contacted the NL HP (same company, different branch) via Teams to inform them the issue is reoccurring. As of yet, no relevant response from the HP.
15:30 The situation seemed stable around 15:00, but now all is down again. Contact with CDN is picked up again.
15:45 The SaaS Operations team asks CDN to escalate the issue internally. Environments become available again.
16:05 TOPdesk again stresses the impact of the situation, as well as the fact that no reply has been received so far.
16:15 The SaaS Operations team notices 522 errors in one of the CDN dashboard (similar to Friday 20th)
16:59 Traffic drops are detected in the CDN load balancer
17:24 Environments are unavailable once more and the 522-errors previously seen increase rapidly.
17:48 Update from technical support engineer at CDN:
"Our escalation commented that this is an issue on your hosting provider, we were able to capture below as it occurred, the MTR makes it into their network then fails. Works fine from Datacenter Management which might point to some firewall behaviour, possibly rate limiting our IP ranges. We suggest you need to work with your hosting provider. "
18:00 HP reports they have performed route optimization, but couldn't find anything else.
18:07 The 522 errors are down to 0 again, environments are becoming available.
19:43 No more disruptions or updates so far; the SaaS Operations team will request for an update in our ticket with HP.
20:00 HP mentions that their network engineers were unable to find anything after troubleshooting the issue twice. They request that we contact CDN to perform a bi-directional MTR.

Below an image of the active connection to the UK1 reverse proxies on Monday:

https://www.topdesk.com/wp-content/media/saas.uk1activeconnections23.png

Tuesday 24-8-2021
9:00 During a meeting with the SaaS Operations team alternatives to the current routing and hosting are discussed. Team members are assigned to investigate and test possible workarounds should the problem reoccur.
9:10 Environments unavailable again. According to HP they can't find any issues.
9:45 The SaaS Operator switches the routing from one of the UK1 containers (a group of customers using shared resources) to go via the NL3 firewall & proxy server to circumvent the UK1 reverse proxies. This change is not successful, and the change is reverted.
9:50 SaaS Operations team changes the firewall configuration to get the new route FROM NL3 to UK1 working. Monitoring is reviewed to ensure VPN traffic remains within the limits.
9:55 SaaS Operations team contacts both the HP and the CDN to set up a conference call at 11:00 CEST.
9:55 SaaS Operations team removes the UK1 proxies individually from the CDN in order to restart them.
10:30 The meeting with the CDN and HP is postponed until 13:00 CEST, so all parties can join the conference call.
10:35 The CDN engineer is directly speaking with an engineer from the HP. They will get back to TOPdesk SaaS for a call with more detailed information.
11:00 SaaS Operations team switches one UK1 container over to use a re-route through the NL3 data center to try the re-routing workaround again. This time it is successful.
11:20 Another UK1 container is switched to reroute through the NL3 data center.
11:35 A third container is switched to reroute through the NL3 data center.
12:50 A drop is seen in the reverse proxy monitoring.
13:00 The SaaS Operations team starts a conference call with engineers from both our CDN and HP.
13:30 The environments were reachable and no party could find any issue. Because all the parties were now available, it was decided to revert the re-routing, so in case the issue occurred again, all teams could cooperate and gather the required data to troubleshoot the problem.
14:00 Re-routing is reverted for all instances to the original configuration.
14:25 The issue is occurring again. The CDN analytics dashboard shows origin timeouts again.
14:45 Outcome of live investigation:
All MTR's from our the reverse proxies to the CDN and the CDN to our proxies show 100% packet loss in one of the hubs inside the HP infrastructure.
The investigation focuses on the inside of the HP infrastructure. The focus lies on the components that are in front of the TOPdesk reverse proxies.

14:50 The SaaS Operations team implements the re-routing workaround again for most containers on the UK1 data centre.
15:05 The CDN analytics shows requests are being handled properly again.
17:55 Received an answer by HP that we are reaching SYN-flood limits on the firewall of the UK1 data centre.

Below an image of the active connection to the UK1 reverse proxies on Tuesday:

https://www.topdesk.com/wp-content/media/saas.uk1activeconnections24.png

Wednesday 25-8-2021
09:00 The SaaS Operations team creates a script to simulate traffic and reproduce the issue.
10:40 The SaaS Operations team asked HP for the Edge firewall configuration in the NL3 data center for comparison.
13:01 HP answered about NL3, there is no SYN-flood set. This is expected as the CDN is also providing SYN flood protection.
19:00 Re-routed last containers through the workaround.
19:10 Reproduced the issue to flood the firewall configuration using the previously created scripts, while all customers are using the workaround to remain unaffected.

Thursday 26-8-2021
09:00 During the day the SaaS Operations team reproduced issue again. We asked HP to remove the firewall flooding configuration, after which we could no longer reproduce the issue with the same test.
11:58 HP answered that the SYN-flood configuration has been removed.
21:00 The SaaS Operations team reverted the re-routing for nearly all the containers on the UK1 data center.

Friday 27-8-2021
09:00 An increased load test on one of the proxy servers was performed to see if it could handle potential extra load from the remaining rerouted environments. The reverse proxy could carry the load and a secondary reverse proxy was available to step in if necessary.

Monday 30-8-2021
13:00 The SaaS Operations team reverted re-routing for the last environments on the UK1 data center.

Troubleshooting delays:
At several times during the investigation, engineers were available to research the problem, only to find that it did not occur at that time. Data was stored during the disruptions so the issue could be investigated, but proved insufficient to pinpoint the root cause.

The firewall managed by the hosting provider activated SYN-flood protection at certain level of traffic. The traffic for the UK1 data center had gradually increased over time and started to reach this limit. On the 23rd, this became more prevalent, and happened almost all the time. Neither the SaaS Operations team nor the hosting provider knew that this limit existed, and that this configuration was present. Initial MTR tests (to check for connectivity issues) also did not pinpoint these machines when our SaaS Operations team performed them, as traffic after our CDN was blurred out. When the MTR test was done from our CDN, this finally returned a clear focus for the investigation.

Having to retry these tests from different locations during a disruption increased the time it took us to get to the root cause of this problem. Communication between the hosting provider and the CDN went back and forth with the SaaS Operations team in the middle because it was unclear at which party the traffic was dropped. Only after rigorous troubleshooting and analyzing several test results, TOPdesk and the CDN could prove that the traffic was being dropped at the hosting provider.

Temporary solution:
During the outage the SaaS Operations team  attempted multiple workarounds to create workable TOPdesk environments. The temporary solution which was put in place re-routed the traffic through our NL3 hosting location. Since this data center has a larger bandwidth capacity and data throughput, the hardware on this location was able to take the additional load from UK1. While this made the environments reachable, re-routing the traffic was not a suitable mid or long term solution because of the additional latency, the additional load on the NL3 hosting location, and the site-to-site VPN being a single point of failure in this temporary set-up.

Below an image of the temporary solution:

https://www.topdesk.com/wp-content/media/saas.uk1rerout.png

Permanent solution
Once the erroneous configuration was identified we requested our hosting provider to remove the SYN-flood protection. The SYN-flood protection was not necessary at that point in our infrastructure, as flood protection is already handled at our CDN. We tested the configuration before and after the adjustment. Afterwards it was no longer possible to reproduce the issue.

With confidence about the root cause we removed the re-routing solution within our maintenance window. Furthermore, we verified that no similar limits were configured in other data centers hosting TOPdesk SaaS services.

FAQ:

Were the outages on Friday the 20th of August the same problem?
Yes. On the Friday before the disruption we noticed 20 minutes of similar unreliability of the UK1 data center. At the time it looked like the CDN provider experienced a hiccup in the routing of the traffic. The cause seemed to be on the CDN side and the CDN confirmed the experienced an incident had occurred at that time. We had already planned to evaluate this issue in the following week when the new disruptions emerged. After double checking the timing we noticed the firewall issue with our hosting provider could also be found in the logs of Friday the 20th, confirming that it was the same issue.

Why did the investigation take so long?
The main reason for the lengthy resolution time was the uncertainty of where exactly the traffic was dropped. The intermittent nature of the issue complicated the investigation further, as there was no fixed time for successful traffic flow and blocked traffic flow.

Because the issue did not occur outside office hours, and the hosting provider executed a change at the end of the workday, we were not sure if the issue would reoccur and were unable to continue our investigation at night.

With no traffic reaching our systems we first contacted our CDN partner, because that is the place where the traffic enters the route to the TOPdesk SaaS servers. While there was an ongoing issue at the CDN, this wasn't related to the issues we experienced. Since there was no confirmed disruption at the CDN, but previous disruptions did indicate the issue was in their infrastructure, we contacted them to investigate. Direct contact proved difficult to establish as our known escalation contacts had left the company or were unavailable. We have evaluated these communication issues with the CDN, and new contact points and communication procedures have been established.

Another cause for delay was that it was not clear to both the Hosting provider and the SaaS Operations team that a firewall was configured with SYN-flood protection. It took some time for the hosting provider to identify the component that caused this disruption. This limit was not clear to the SaaS Operations team because it was part of the infrastructure before the CDN was taken into use. With the CDN in use this SYN-flood protection became superfluous.

An evaluation with the hosting provider is scheduled, where we'll evaluate our communication procedures and make sure that a complete infrastructure overview with monitoring on all relevant components and limits is available to quickly troubleshoot future disruptions.

Why was the faulty component not redundant?
The problematic component was the edge firewall within the hosting providers network. This component has a fail over partner and is redundant, but the component did not fail. It blocked  traffic after a certain threshold was reached, as it was set up to do.

What is TOPdesk doing to prevent this from happening again?
While the root cause has been resolved, multiple points of improvement have been identified and resolved. This includes:
- Checking all locations hosting TOPdesk instances for similar SYN-flood limits.
- Re-establishing direct communication lines with technicians and clear escalation paths for communicating with the CDN and hosting provider.
- Improving options for communication to our suppliers directly from our internal investigation incidents to aid investigation speed and knowledge sharing.
- Updating procedures regarding updates on our Status page, both regarding incident communication and announcing (emergency) maintenance

Next to this, several follow-up actions have been identified to further improve our reliability and troubleshooting speed. These actions are listed below.

Follow up actions
Internally:
- Evaluate the communication channels and resources used for a disruption of this size.
- Create a system to automatically periodically log the right network traffic logs (MTR) to ensure this information is directly available when a similar disruption occurs.

Externally:
Plan a meeting with our Hosting Provider to establish:
- Why the configuration of this piece of hardware was not clear.
- Why there was a SYN flood protection in place while the CDN protects the network from these attacks?
- Why it took so long to identify the problem, and whether we differ from the standard setup?
- Who will monitor the incoming connections and the activation of flood protection?

Plan a meeting with our CDN to establish:
- Can we create a tracing system to quickly determine which component is dropping traffic?

The impact
We fully recognize the impact this disruption has had on our UK1 customers and apologize for this incident. The amount of downtime experienced on August 23rd in particular is not acceptable and, as detailed above, we have multiple points of improvement to action.

If you have further questions regarding this issue then please contact your Account Manager or our Customer Success team. In the mean time further comment from our senior management team on recent disruptions in UK1 will be published in due course.

Once again, we apologize for the disruption this issue has caused.

Posted Sep 08, 2021 - 12:17 CEST

Resolved
The root cause for this problem has been identified and resolved. During a load test we've confirmed the change works as expected, also under high load.

We've announce maintenance reverting the temporary routing changes. See https://status.topdesk.com/incidents/19tdkxlc5pvj

We aim to publish a root cause analysis with follow-up measures on this status page before September 10th.
Posted Aug 26, 2021 - 16:34 CEST
Update
We have been monitoring the situation on the UK1 data center today and we have not seen the issue reoccurring.

For now, the temporary re-route will remain in place until a permanent solution is applied. Working in cooperation with our hosting provider, we believe that we have identified the faulty component. To verify this, we will continue testing tonight. This will ensure the customer impact is as minimal as possible.

Once we have successfully verified the correct component, a proper solution will be put in place. This solution will then once more be validated using similar tests, so we can be absolutely certain customer experience will not be impacted once the re-route is undone.

In the meantime our Operations team will continue to monitor the current situation.

Once we're confident we have a permanent solution in place, we will then evaluate what happened during this disruption and publish a root cause analysis.
Posted Aug 25, 2021 - 17:12 CEST
Update
Today (24/8) and yesterday (23/8) we have received multiple questions from customers regarding the recent SaaS disruptions experienced by customers hosted in our UK1 location. The following details the most frequently asked questions submitted to TOPdesk Support:

What can you tell us about this issue, and what's the current status?

At various points in time over the past 48 hours, beginning Monday 23/8 at approximately 08:00 GMT, our monitoring has shown significant, unexplained drop offs in traffic routed through our UK hosting partners to otherwise healthy services. TOPdesk's infrastructure has remained available and reachable from within the TOPdesk network, indicating that traffic was being prevented from reaching our network. In addition to beginning our own major incident process, we logged high-priority incidents with two of our key hosting partners; Cloudflare (our CDN partner) and Leaseweb (our hosting provider), and all three parties have been cooperating closely to identify the root cause and long term resolution to this issue. 

We have implemented a re-routing solution via our Netherlands (NL) -based proxy servers and this appears to have brought stability. At time of writing (24/8), our monitoring indicates that normal levels of traffic have been visiting TOPdesk since approximately 11:00 GMT, with the exception of one period of disruption detailed later. The best information currently available indicates that the root cause of this disruption has been caused by a specific piece of hardware at the hosting provider, and we will keep this re-routing workaround in place for the time being to minimize the risk of further disruption. We're closely monitoring the systems that are used for re-routing the traffic to ensure this workaround continues to work, and whilst a permanent solution is still in progress, engineers at all parties involved are on stand-by to quickly resolve any issues that may occur.

I asked multiple times for an ETA on the fix. Why were you unable to provide this?

This incident was a very complex networking issue originating from outside the TOPdesk network, and consequently we were only able to pass on information provided by our suppliers. We are usually able to pinpoint specific metrics to provide clues on where the issue lies, but in this case there was nothing to indicate the ultimate probable cause (hardware failure) in the monitoring of ourselves, Cloudflare or Leaseweb. This issue was unprecedented in terms of the cause of the issue, the lack of information we were able to gather from the monitoring and the disruption caused, and the intermittent nature of it hampered the troubleshooting efforts further still. The combination of these factors lead to a situation where for a long period of time it was not clear what the root cause was, and without an idea of the cause it was not possible for us to provide an ETA on the fix.

I updated my support ticket with no response. Why was this?

Although our UK, SaaS and international Support teams did our best to respond to individual concerns and customer questions, this incident was unprecedented in its scope, length of disruption and ultimately the sheer volume of updates / interactions / information requests. This lead to a situation where it was simply unfeasible to respond to all queries individually, and some customers had their requests for information unanswered as a result. This is regrettable, but please note that the proper process for providing updates to customers impacted by major disruptions such as this is via status.topdesk.com, which you can subscribe to updates from. This is the most scalable way for us to provide frequent updates to multiple customers and will continue to be our primary source of information relating to SaaS disruptions.

When you achieved stability routing traffic via NL, why did you revert back to the original (faulty) setup?

At the time we reverted back to the original network configuration (approx 13:40 GMT 24/8), both ourselves and our hosting provider had made configuration changes and we did not know for sure if this would result in further disruption. Ultimately, we needed to test if the configuration changes would bring long-term stability, which turned out not to be the case. We have received feedback that this shouldn't have been done this during the afternoon and we accept that point, however it was felt that we could reach a long term resolution to this issue quicker by testing sooner. Another question we had related to why this couldn't have been tested in a different / less potentially disruptive way. Although we are proficient with performance testing, it was not possible to test this properly as the load would have been a simulated load which can't guarantee the same results. It should be noted that we reverted back to the re-routing solution quickly to restore service, and the information gathered during this process directly lead to what appears to be the root cause being identified.

The original communication on August 23rd indicated an issue with the CDN. Why was this?

During the early stages of this investigation, the characteristics we were seeing in the monitoring were consistent with previous CDN-related issues. Leaseweb also supported this conclusion initially after also checking their metrics. Once new information became available, we were able to communicate these updates to our customer base.

Are you aware of how disruptive this is / has been for us? Is senior management involved?

We are absolutely aware of the problems this has caused for our customer base. Once we / Leaseweb are able to fully confirm the  root cause of this issue has been fixed, we will be conducting a full evaluation in collaboration with our hosting partners and further information will be published. Multiple stakeholders - including senior management - are aware of this issue and we will be working together moving forward to be as transparent as possible with customers.

What are the next steps?

Short term, we will continue to monitor the NL-rerouting workaround we have in place until Leaseweb is able to confirm the cause of this disruption. When we have confidence in reverting to the original routing setup we will make this adjustment in a maintenance window and customers will be notified of this change. We will respond to outstanding individual customer questions and publish further information including a statement explaining this disruption and Root Cause Analysis (including preventative measures) at the earliest opportunity. Should you wish to discuss the way we handle this disruption further after this information has been published, we kindly refer you to your Account Manager.
Posted Aug 24, 2021 - 20:07 CEST
Update
We are continuing to monitor for any further issues.
Posted Aug 24, 2021 - 17:14 CEST
Monitoring
During the last instance of the disruption we were able to pinpoint a specific piece of hardware at the hosting provider that is likely causing this disruption. We'll keep the re-routing in place while the hosting provider investigates and implements a solution for this issue.

We're closely monitoring the systems that are used for re-routing the traffic to ensure this workaround continues to work. While a permanent solution is in progress, engineers at all parties involved are on stand-by to quickly resolve any issues that may occur.

When the fix is implemented by the hosting provider, we'll announce reverting the routing to the original state on our status page.
Posted Aug 24, 2021 - 16:45 CEST
Update
We started re-routing traffic again and we see connectivity returned to normal levels. We've stored a lot of data during the disruption and will analyze this with all parties involved to find and resolve the root cause.
Posted Aug 24, 2021 - 15:09 CEST
Update
We're still gathering data with all parties involved before re-enabling the re-routing to mitigate the impact.
Posted Aug 24, 2021 - 14:58 CEST
Investigating
We are currently investigating this issue.
Posted Aug 24, 2021 - 14:49 CEST
Update
Several TOPdesk environments are again unreachable. We are aware of the problem and are coordinating with all parties involved to store all data needed to investigate the issue and mitigate the impact as soon as possible.
Posted Aug 24, 2021 - 14:45 CEST
Update
While implementing the re-routing of traffic for the UK1 hosting location, changes were also made by our hosting provider in an effort to resolve the problem. To ensure the changes by the hosting provider are a permanent solution to this problem, we have reverted our re-routing changes.

So far we have not seen the issue re-occur since implementing the changes, indicating the issue was fixed by the hosting provider. We will continue to monitor the situation, and remain in direct communication with our CDN provider and the hosting provider. If the issues occur again we will take immediate action to avoid any further impact.

When we're certain the issue is resolved this major incident will be marked as closed. A root cause analysis will be published on our status page afterwards.
Posted Aug 24, 2021 - 14:29 CEST
Update
So far we've re-routed half of the traffic to the UK1 hosting location. Since we've started doing this the situation has stabilized and we haven't seen any big disruptions.

This re-routing is considered a temporary workaround. We're still working with the CDN provider and the hosting provider to find and resolve the root cause of this problem.
Posted Aug 24, 2021 - 13:09 CEST
Update
Our initial tests with re-routing traffic are working as intended and we're now re-routing some of the production traffic to mitigate the impact of this disruption. Environments with re-routed traffic were reachable and responsive at a time when other environments were experiencing delays. We'll continue to re-route more traffic during the day while closely monitoring the load on systems involved, ensuring fewer customers are affected should the issue reoccur.

Note that this re-routing is considered a temporary workaround. We're still working with the CDN provider and the hosting provider to find and resolve the root cause of this problem.
Posted Aug 24, 2021 - 11:28 CEST
Update
Similar intermittent connection issues are again occurring today. Both the CDN provider and the hosting provider are involved in the investigation. This issue has a high priority with all parties involved. We're coordinating all efforts and we're doing what we can to find and resolve the root cause.

Meanwhile, another group is investigating and testing alternatives to circumvent the problem. The CDN provider is an integral part of our service and is hard to replace/circumvent because we're also using other services next to their routing. However, we're working on re-routing traffic, and are taking servers out of our production pool and restarting them to test if that mitigates the problem.

We'll try to keep you updated on our progress on a regular basis, and will post an update with answers to frequently asked questions later today. We know this has a high impact on your services and are doing what we can to resolve this as soon as possible.

On a side note; If you'd like to communicate with us about this disruption, please don't send an e-mail to with the major incident number (TDR21 08 4950) in the subject. This disrupts our processes, makes it harder to keep this page updated, and makes it difficult for us to contact you. Instead, you can contact us via My TOPdesk, use a different subject line, or call our Support team.
Posted Aug 24, 2021 - 10:19 CEST
Update
Several TOPdesk environments are again unreachable. We are aware of the problem and are working on a solution with all parties involved.
Posted Aug 24, 2021 - 09:14 CEST
Update
The CDN provider has indicated that the cause of the problem is further down the line with the hosting provider.

We have tickets open with both suppliers and we will continue to investigate.
Posted Aug 23, 2021 - 20:13 CEST
Update
The environments in the UK1 data center are reachable again.

We continue to investigate with our CDN provider and with the hosting provider to find the cause of the disruptions.
Posted Aug 23, 2021 - 18:19 CEST
Update
We have received further reports of more disruptions - we have informed Cloudflare and are working on a solution
Posted Aug 23, 2021 - 17:35 CEST
Update
We are currently in the process of collating information gathered, and will issue a response to some frequently asked questions via this major incident and our status page before the end of the day tomorrow, 24/8.
Posted Aug 23, 2021 - 16:58 CEST
Update
Operators at TOPdesk and at the CDN provider are still working to ensure the root cause of this problem is resolved as soon as possible. We will keep the incident in My TOPdesk open and published so you can easily report any problems you might experience.

While we aim to provide regular updates of ongoing disruptions the status page and this major incident will not be updated until tomorrow, unless there's a change in the availability of SaaS environments.
Posted Aug 23, 2021 - 16:40 CEST
Update
We are continuing to monitor for any further issues.
Posted Aug 23, 2021 - 16:23 CEST
Update
All TOPdesk environments in the UK1 hosting location are back online. The CDN provider did not provide details regarding a fix or a root cause analysis, so we're not sure if environments will continue to be available.

We're still working with the CDN provider to ensure the root cause of this problem is resolved as soon as possible.
Posted Aug 23, 2021 - 15:51 CEST
Update
Several TOPdesk environments are again unreachable. We are aware of the problem and are working on a solution with the CDN provider.
Posted Aug 23, 2021 - 15:21 CEST
Update
All TOPdesk environments in the UK1 hosting location are back online. We haven't received any details regarding a fix or a root cause analysis, so we're not sure if environments will continue to be available.

We're still working with the CDN provider to ensure the root cause of this problem is resolved as soon as possible.
Posted Aug 23, 2021 - 14:28 CEST
Update
We're still working to resolve the situation with the highest priority.
Posted Aug 23, 2021 - 13:53 CEST
Update
We confirmed that the same issue is occurring as this morning. We're working with the CDN provider to resolve the situation.
Posted Aug 23, 2021 - 13:08 CEST
Update
Several TOPdesk environments are again unreachable. We are aware of the problem and are working on a solution.
Posted Aug 23, 2021 - 12:56 CEST
Monitoring
All TOPdesk environments in the UK1 hosting location are back online. We're working with the CDN provider to ensure the root cause of this problem is resolved as soon as possible.
Posted Aug 23, 2021 - 11:54 CEST
Update
We're still working with the CDN provider to resolve this issue. We have made sure the CDN provider is aware of the impact on our customers and we're making sure they continue to work on this with the highest priority.
Posted Aug 23, 2021 - 11:14 CEST
Update
The CDN provider is still working on this issue with a high priority. We'll update you when more information is available.
Posted Aug 23, 2021 - 10:17 CEST
Update
This issue appears to be caused by the same problem as we saw last Friday and on July 30th. We're working with the Content Delivery Network provider to make sure the root cause is analyzed and resolved promptly.
Posted Aug 23, 2021 - 09:34 CEST
Investigating
We are currently experiencing problems on our UK1 hosting location. As a result your TOPdesk environment may not be available.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR21 08 4950.
Posted Aug 23, 2021 - 09:14 CEST
This incident affected: UK1 SaaS hosting location.