Node behind NAT - Flannel setup #6941

koxu1996 · 2024-10-07T05:26:11Z

koxu1996
Oct 7, 2024

ℹ️ I am sharing this, because I had networking problem that was bugging me for over 6 months, and today I finally found the solution 😌. Hope it helps someone!

TL;DR: If you are behind NAT, then enable persistent keepalive for WireGuard.

My story

I basically wanted to offload some traffic to my home machine (behind NAT), as it was idle for the most of time. In my cluster I aready had serviceLB enabled and wireguard backend for Flannel. Therefore I set up RKE2 agent, and I hoped it would work right away with Traefik ingress.

However, since the beginning I had the weird and non-persistent connection issues. For example cert-manager was failing with context deadline exceeded which was caused by request timeout. The workaround for this was ridiculous, but always working: create some pod, exec into it and ping the target URL. In the beginning I thought it was a DNS issue, but later I discovered that the whole inter-node connectivity is somehow broken, at least for nodes trying to reach my home machine.

I have dynamic public IP, so I configured my Mikrotik router to forward all required port according to docs. The most important was port 51820 UDP - Canal CNI with WireGuard IPv4. I spent hours setting up firewall rules (even added rule for accepting all traffic), but still my home node was sometimes unreachable.

NAT workaround

I did a lot of things to track down the issue: analyzed iptables, observed traffic with tcpdump, logged packets in router, but nothing helped. While playing around with network interfaces - mostly trying to understand WireGuard connection - I discovered that ping from my home machine to one of nodes unblocked connection between those nodes for next 3 minutes 🤔. This got me into more time spent on Mikrotik configuration, as this was clearly NAT issue.

Final solution

Breakthrough was result of running wg show command, where I discovered that latest handshake was sent over hour ago. I thought it should be sent much more frequently... quick research and I discovered that WireGuard has persistent keepalives ✨:

Because NAT and stateful firewalls keep track of "connections", if a peer behind NAT or a firewall wishes to receive incoming packets, he must keep the NAT/firewall mapping valid, by periodically sending keepalive packets. This is called persistent keepalives. When this option is enabled, a keepalive packet is sent to the server endpoint once every interval seconds.

So I adjusted /var/lib/rancher/rke2/server/manifests/rke2-canal-config.yaml:

 apiVersion: helm.cattle.io/v1
 kind: HelmChartConfig
 metadata:
   name: rke2-canal
   namespace: kube-system
 spec:
   valuesContent: |-
     flannel:
       backend: "wireguard"
+      keepaliveInterval: 25

Canal daemon set restarted with kubectl rollout restart ds rke2-canal -n kube-system and boom: inter-node traffic is now working as expected 🚀!

manuelbuil · 2024-10-07T11:14:33Z

manuelbuil
Oct 7, 2024
Maintainer

Thanks Thanks a lot for sharing this!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node behind NAT - Flannel setup #6941

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Node behind NAT - Flannel setup #6941

koxu1996 Oct 7, 2024

My story

NAT workaround

Final solution

Replies: 1 comment

manuelbuil Oct 7, 2024 Maintainer

koxu1996
Oct 7, 2024

manuelbuil
Oct 7, 2024
Maintainer