[K8S] CoreDNS timeouts

Incident Report for Scaleway

Resolved

The DNS resolution failure has been resolved. To fix the issue, please proceed as follows:
Replace or reboot your nodes to get the fix.
Use a fqdn to resolve your addresses inside the Private Network. Meaning if you have a foo service in the mypn pn foo will not be resolved and you must use foo.mypn
Important notes:
Please avoid using an existing TLD (https://en.wikipedia.org/wiki/Top-level_domain) as PN name if you do so you can have some problem. For instance, if you name your pn fr you will not be able to resolve google.fr externally, meaning if you have not a service named google in your PN you won’t revsolv google.fr and if you have one, it will not be the good one.
Please note that prod and dev are TLD (https://data.iana.org/TLD/tlds-alpha-by-domain.txt)”

Posted Feb 20, 2024 - 16:49 CET

Update

We are still working on a permanent fix for this issue.
The issue can be temporarily mitigated by replacing forward . /etc/resolv.conf by forward . 169.254.169.254 for clusters using private networks and forward . [public resolver of choice] for multicloud and legacy public clusters in the ConfigMap kube-system/coredns

Posted Feb 16, 2024 - 09:30 CET

Update

We are continuing to investigate this issue.

Posted Feb 15, 2024 - 11:19 CET

Investigating

Some nodes experience up to 10% DNS resolution failures with I/O timeouts to 169.254.53.53:53.

The issue is identified, we are working on a fix

Posted Feb 12, 2024 - 16:00 CET

This incident affected: Elements - AZ (fr-par-1, fr-par-2, fr-par-3, nl-ams-1, pl-waw-1, nl-ams-2, pl-waw-2, nl-ams-3, pl-waw-3) and Elements - Products (Kapsule).