Summary
On the 23rd and 24th of August, 2021 all customers with TOPdesk SaaS environments hosted in our UK1 hosting location experienced a series of intermittent disruptions which caused their TOPdesk environment to become unreachable. After almost 2 days of troubleshooting and investigating with engineers from both our Content Delivery Network (CDN) partner and our hosting provider, a faulty configuration was identified.
To mitigate the impact while working on a solution, a temporary alternate routing workaround was set up to circumvent this component. The workaround was stable and customers could again work in their TOPdesk environment, while we continued our investigation into the root cause. After testing the proposed solution, the faulty component was fixed and the temporary re-routing was reverted in order to create a permanent solution to this problem.
Upon evaluation, several points of improvement have been identified in a number of areas in order to prevent this issue occurring again, and to be able to improve the speed at which we can troubleshoot and resolve issues of this nature moving forward. These improvements include - but are not limited to - our information gathering and communication processes both with our hosting partners and to impacted customers.
Infrastructure:
To provide insight into this issue, the below diagram depicts how traffic flows to TOPdesk environments:
https://www.topdesk.com/wp-content/media/saas.uk1rout.png
Root cause:
Traffic to our UK1 data center first gets routed through our Content Delivery Network (CDN), where it is then directed towards reverse proxy servers in the correct data center for the target TOPdesk environment. On it's path, at the boundary between the Internet and the hosting infrastructure where our reverse proxies run, is a physical firewall (a redundant pair). This firewall had a default SYN-flood limit configured of 1024 new connections per second. Neither our hosting provider nor the TOPdesk SaaS Operations team were aware of the existence of a SYN-flood limit in this firewall.
Due to the increase in traffic to the UK1 hosting location, this limit was briefly first reached on Friday August 20th and again on August 23rd and 24th. The firewall temporarily blocked new connections for a time until the connection rate decreased below a threshold where it would allow connections again. By re-routing traffic via a different hosting location (and edge firewall) the traffic limit was circumvented and the issue no longer occurred. The issue was resolved by adjusting the SYN-flood limit configuration.
Note that the information communicated via our Status page on August 20th indicated the cause of this issue was most likely to lie with our CDN partner, and that we were seeking further information from them. This conclusion was reached because the monitoring information available to us at that time indicated an issue which was consistent with previous CDN-related disruptions.
Time line:
The below time line details the steps taken during this disruption. We keep track of all our actions to properly evaluate disruptions. In this time line we refer to the TOPdesk hosting team as SaaS Operations, the Content Delivery Network as CDN and the hosting provider as HP. All times are in Central European Summer Time / CEST (GMT +1 / UTC +2).
Friday 20-08-2021
13:52 A drop in traffic to our reverse proxies on UK1 was noticed in the monitoring.
13:57 SaaS Operations team starts an investigation as we can see traffic is not reaching our Reverse proxies as expected.
14:01 A ticket with our CDN created requesting further information.
14:12 Traffic to UK1 is restored.
15:11 CDN indicates that the disruption was on their side and the ticket was closed. TOPdesk requested an RCA and planned an evaluation.
Below an image of the active connection to the UK1 reverse proxies on Friday:
https://www.topdesk.com/wp-content/media/saas.uk1activeconnections20.png
Monday 23-08-2021
9:00 TOPdesk SaaS monitoring indicated that multiple UK1 environments were not accessible. Our monitoring checks (amongst others) the availability of the host name, and these could not be resolved for UK1 environments.
9:17 Some re-routing is reported for CDN points of presence in UK and Ireland at the moment according to their status page. SaaS Operations confirms the environments are running but cannot be reached from outside the network.
9:22 The reverse proxy on UK1 drops noticeably in TCP connections.
9:28 SaaS Operations creates a support request for the CDN. Similar errors and symptoms are seen on our end.
9:49 The SaaS Operations team noticed a diverging A record for one of the reverse proxies. This doesn't seem to be the issue at hand, because nothing indicates it has been changed in the last months, and the reverse proxy does show traffic at times.
10:06 The network to the data center seems to be operational but there is no traffic coming in on the reverse proxies.
10:09 The SaaS Operations team called with engineers from the CDN; an engineer would look into it and reply in the previously created ticket.
11:07 The CDN responds and picks up the tickets.
11:31 All connections on the reverse proxies are coming back and environments are reachable again.
12:55 The issue re-occurs and the CDN was called. They responded with steps to troubleshoot.
13:23 SaaS Operations team contacted the UK1 HP, to see if they could see anything on their end to rule out an alternative root cause.
14:39 SaaS Operations team contacted the development team responsible for the authentication service and another call made to UK HP to investigate.
15:00 Environments seem to be available again.
15:23 The SaaS Operations team contacted the NL HP (same company, different branch) via Teams to inform them the issue is reoccurring. As of yet, no relevant response from the HP.
15:30 The situation seemed stable around 15:00, but now all is down again. Contact with CDN is picked up again.
15:45 The SaaS Operations team asks CDN to escalate the issue internally. Environments become available again.
16:05 TOPdesk again stresses the impact of the situation, as well as the fact that no reply has been received so far.
16:15 The SaaS Operations team notices 522 errors in one of the CDN dashboard (similar to Friday 20th)
16:59 Traffic drops are detected in the CDN load balancer
17:24 Environments are unavailable once more and the 522-errors previously seen increase rapidly.
17:48 Update from technical support engineer at CDN:
"Our escalation commented that this is an issue on your hosting provider, we were able to capture below as it occurred, the MTR makes it into their network then fails. Works fine from Datacenter Management which might point to some firewall behaviour, possibly rate limiting our IP ranges. We suggest you need to work with your hosting provider. "
18:00 HP reports they have performed route optimization, but couldn't find anything else.
18:07 The 522 errors are down to 0 again, environments are becoming available.
19:43 No more disruptions or updates so far; the SaaS Operations team will request for an update in our ticket with HP.
20:00 HP mentions that their network engineers were unable to find anything after troubleshooting the issue twice. They request that we contact CDN to perform a bi-directional MTR.
Below an image of the active connection to the UK1 reverse proxies on Monday:
https://www.topdesk.com/wp-content/media/saas.uk1activeconnections23.png
Tuesday 24-8-2021
9:00 During a meeting with the SaaS Operations team alternatives to the current routing and hosting are discussed. Team members are assigned to investigate and test possible workarounds should the problem reoccur.
9:10 Environments unavailable again. According to HP they can't find any issues.
9:45 The SaaS Operator switches the routing from one of the UK1 containers (a group of customers using shared resources) to go via the NL3 firewall & proxy server to circumvent the UK1 reverse proxies. This change is not successful, and the change is reverted.
9:50 SaaS Operations team changes the firewall configuration to get the new route FROM NL3 to UK1 working. Monitoring is reviewed to ensure VPN traffic remains within the limits.
9:55 SaaS Operations team contacts both the HP and the CDN to set up a conference call at 11:00 CEST.
9:55 SaaS Operations team removes the UK1 proxies individually from the CDN in order to restart them.
10:30 The meeting with the CDN and HP is postponed until 13:00 CEST, so all parties can join the conference call.
10:35 The CDN engineer is directly speaking with an engineer from the HP. They will get back to TOPdesk SaaS for a call with more detailed information.
11:00 SaaS Operations team switches one UK1 container over to use a re-route through the NL3 data center to try the re-routing workaround again. This time it is successful.
11:20 Another UK1 container is switched to reroute through the NL3 data center.
11:35 A third container is switched to reroute through the NL3 data center.
12:50 A drop is seen in the reverse proxy monitoring.
13:00 The SaaS Operations team starts a conference call with engineers from both our CDN and HP.
13:30 The environments were reachable and no party could find any issue. Because all the parties were now available, it was decided to revert the re-routing, so in case the issue occurred again, all teams could cooperate and gather the required data to troubleshoot the problem.
14:00 Re-routing is reverted for all instances to the original configuration.
14:25 The issue is occurring again. The CDN analytics dashboard shows origin timeouts again.
14:45 Outcome of live investigation:
All MTR's from our the reverse proxies to the CDN and the CDN to our proxies show 100% packet loss in one of the hubs inside the HP infrastructure.
The investigation focuses on the inside of the HP infrastructure. The focus lies on the components that are in front of the TOPdesk reverse proxies.
14:50 The SaaS Operations team implements the re-routing workaround again for most containers on the UK1 data centre.
15:05 The CDN analytics shows requests are being handled properly again.
17:55 Received an answer by HP that we are reaching SYN-flood limits on the firewall of the UK1 data centre.
Below an image of the active connection to the UK1 reverse proxies on Tuesday:
https://www.topdesk.com/wp-content/media/saas.uk1activeconnections24.png
Wednesday 25-8-2021
09:00 The SaaS Operations team creates a script to simulate traffic and reproduce the issue.
10:40 The SaaS Operations team asked HP for the Edge firewall configuration in the NL3 data center for comparison.
13:01 HP answered about NL3, there is no SYN-flood set. This is expected as the CDN is also providing SYN flood protection.
19:00 Re-routed last containers through the workaround.
19:10 Reproduced the issue to flood the firewall configuration using the previously created scripts, while all customers are using the workaround to remain unaffected.
Thursday 26-8-2021
09:00 During the day the SaaS Operations team reproduced issue again. We asked HP to remove the firewall flooding configuration, after which we could no longer reproduce the issue with the same test.
11:58 HP answered that the SYN-flood configuration has been removed.
21:00 The SaaS Operations team reverted the re-routing for nearly all the containers on the UK1 data center.
Friday 27-8-2021
09:00 An increased load test on one of the proxy servers was performed to see if it could handle potential extra load from the remaining rerouted environments. The reverse proxy could carry the load and a secondary reverse proxy was available to step in if necessary.
Monday 30-8-2021
13:00 The SaaS Operations team reverted re-routing for the last environments on the UK1 data center.
Troubleshooting delays:
At several times during the investigation, engineers were available to research the problem, only to find that it did not occur at that time. Data was stored during the disruptions so the issue could be investigated, but proved insufficient to pinpoint the root cause.
The firewall managed by the hosting provider activated SYN-flood protection at certain level of traffic. The traffic for the UK1 data center had gradually increased over time and started to reach this limit. On the 23rd, this became more prevalent, and happened almost all the time. Neither the SaaS Operations team nor the hosting provider knew that this limit existed, and that this configuration was present. Initial MTR tests (to check for connectivity issues) also did not pinpoint these machines when our SaaS Operations team performed them, as traffic after our CDN was blurred out. When the MTR test was done from our CDN, this finally returned a clear focus for the investigation.
Having to retry these tests from different locations during a disruption increased the time it took us to get to the root cause of this problem. Communication between the hosting provider and the CDN went back and forth with the SaaS Operations team in the middle because it was unclear at which party the traffic was dropped. Only after rigorous troubleshooting and analyzing several test results, TOPdesk and the CDN could prove that the traffic was being dropped at the hosting provider.
Temporary solution:
During the outage the SaaS Operations team attempted multiple workarounds to create workable TOPdesk environments. The temporary solution which was put in place re-routed the traffic through our NL3 hosting location. Since this data center has a larger bandwidth capacity and data throughput, the hardware on this location was able to take the additional load from UK1. While this made the environments reachable, re-routing the traffic was not a suitable mid or long term solution because of the additional latency, the additional load on the NL3 hosting location, and the site-to-site VPN being a single point of failure in this temporary set-up.
Below an image of the temporary solution:
https://www.topdesk.com/wp-content/media/saas.uk1rerout.png
Permanent solution
Once the erroneous configuration was identified we requested our hosting provider to remove the SYN-flood protection. The SYN-flood protection was not necessary at that point in our infrastructure, as flood protection is already handled at our CDN. We tested the configuration before and after the adjustment. Afterwards it was no longer possible to reproduce the issue.
With confidence about the root cause we removed the re-routing solution within our maintenance window. Furthermore, we verified that no similar limits were configured in other data centers hosting TOPdesk SaaS services.
FAQ:
Were the outages on Friday the 20th of August the same problem?
Yes. On the Friday before the disruption we noticed 20 minutes of similar unreliability of the UK1 data center. At the time it looked like the CDN provider experienced a hiccup in the routing of the traffic. The cause seemed to be on the CDN side and the CDN confirmed the experienced an incident had occurred at that time. We had already planned to evaluate this issue in the following week when the new disruptions emerged. After double checking the timing we noticed the firewall issue with our hosting provider could also be found in the logs of Friday the 20th, confirming that it was the same issue.
Why did the investigation take so long?
The main reason for the lengthy resolution time was the uncertainty of where exactly the traffic was dropped. The intermittent nature of the issue complicated the investigation further, as there was no fixed time for successful traffic flow and blocked traffic flow.
Because the issue did not occur outside office hours, and the hosting provider executed a change at the end of the workday, we were not sure if the issue would reoccur and were unable to continue our investigation at night.
With no traffic reaching our systems we first contacted our CDN partner, because that is the place where the traffic enters the route to the TOPdesk SaaS servers. While there was an ongoing issue at the CDN, this wasn't related to the issues we experienced. Since there was no confirmed disruption at the CDN, but previous disruptions did indicate the issue was in their infrastructure, we contacted them to investigate. Direct contact proved difficult to establish as our known escalation contacts had left the company or were unavailable. We have evaluated these communication issues with the CDN, and new contact points and communication procedures have been established.
Another cause for delay was that it was not clear to both the Hosting provider and the SaaS Operations team that a firewall was configured with SYN-flood protection. It took some time for the hosting provider to identify the component that caused this disruption. This limit was not clear to the SaaS Operations team because it was part of the infrastructure before the CDN was taken into use. With the CDN in use this SYN-flood protection became superfluous.
An evaluation with the hosting provider is scheduled, where we'll evaluate our communication procedures and make sure that a complete infrastructure overview with monitoring on all relevant components and limits is available to quickly troubleshoot future disruptions.
Why was the faulty component not redundant?
The problematic component was the edge firewall within the hosting providers network. This component has a fail over partner and is redundant, but the component did not fail. It blocked traffic after a certain threshold was reached, as it was set up to do.
What is TOPdesk doing to prevent this from happening again?
While the root cause has been resolved, multiple points of improvement have been identified and resolved. This includes:
- Checking all locations hosting TOPdesk instances for similar SYN-flood limits.
- Re-establishing direct communication lines with technicians and clear escalation paths for communicating with the CDN and hosting provider.
- Improving options for communication to our suppliers directly from our internal investigation incidents to aid investigation speed and knowledge sharing.
- Updating procedures regarding updates on our Status page, both regarding incident communication and announcing (emergency) maintenance
Next to this, several follow-up actions have been identified to further improve our reliability and troubleshooting speed. These actions are listed below.
Follow up actions
Internally:
- Evaluate the communication channels and resources used for a disruption of this size.
- Create a system to automatically periodically log the right network traffic logs (MTR) to ensure this information is directly available when a similar disruption occurs.
Externally:
Plan a meeting with our Hosting Provider to establish:
- Why the configuration of this piece of hardware was not clear.
- Why there was a SYN flood protection in place while the CDN protects the network from these attacks?
- Why it took so long to identify the problem, and whether we differ from the standard setup?
- Who will monitor the incoming connections and the activation of flood protection?
Plan a meeting with our CDN to establish:
- Can we create a tracing system to quickly determine which component is dropping traffic?
The impact
We fully recognize the impact this disruption has had on our UK1 customers and apologize for this incident. The amount of downtime experienced on August 23rd in particular is not acceptable and, as detailed above, we have multiple points of improvement to action.
If you have further questions regarding this issue then please contact your Account Manager or our Customer Success team. In the mean time further comment from our senior management team on recent disruptions in UK1 will be published in due course.
Once again, we apologize for the disruption this issue has caused.