Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netbird 0.32.0 breaks K3s 1.32.2+k3s1 with flannel due to iptables conflicts #2926

Open
christian-schlichtherle opened this issue Nov 21, 2024 · 3 comments

Comments

@christian-schlichtherle

Describe the problem

We're operating an IoT project where some K3s nodes are placed at customer premises. So we are installing Netbird 0.32.0 on each node first and then install K3s v1.32.2+k3s1 using flannel next. When installing k3s, we are providing flannel-iface=wt0 to tell it to use the Netbird interface for node-to-node communication.

This works great to some extent but there is a problem: When the Netbird service starts, it sets up its iptables rules. Also, flannel sets up its iptables rules. However, there seem to be conflicts in those rules, resulting in communication being broken after every restart of the Netbird service, e.g. when installing an upgrade. As a workaround, I have to restart the k3s(-agent) service after every restart of the Netbird service.

Summing it up, to restart all Netbird services in the cluster, I have to do something like this:

ansible k3s_server -b -m shell --forks 1 -a 'systemctl restart netbird && sleep 3 && systemctl restart k3s'
ansible k3s_agent -b -m shell -a 'systemctl restart netbird && sleep 3 && systemctl restart k3s-agent'

As you can imagine, this is not a sustainable solution, just a hacky workaround.

Is this a known issue? What are my options? Wait for a fix or try another CNI like cilium?

To Reproduce

Steps to reproduce the behavior:

  • Install Netbird on a bunch of nodes
  • Install K3s on the nodes with flannel-iface=wt0
  • Restart the netbird service only and watch the in-cluster communication to break, e.g. you can't kubectl logs <any-pod> anymore.

Expected behavior

Not breaking the in-cluster communication by leaving flannel's iptable rules alone.

Are you using NetBird Cloud?

Yes

NetBird version

0.32.0

NetBird status -dA output:

n/a

Do you face any (non-mobile) client issues?

Yes.

Screenshots

n/a

Additional context

See above.

@christian-schlichtherle
Copy link
Author

BTW: This is a long-standing problem, I just had no time to report it earlier.

@lixmal
Copy link
Contributor

lixmal commented Nov 22, 2024

Hi @christian-schlichtherle, can you post your iptables/nftables before and after your workaround?

iptables-save
nft list ruleset

You might need to install nftables for the nft tool to be available

@christian-schlichtherle
Copy link
Author

@lixmal I have run these commands. Unfortunately, the output reveals too much sensitive information to share it here, but in order to have a meaningful diff, I processed the output as follows:

ansible my-worker-node -b -a 'iptables-save' > 10_iptables-save_before
ansible my-worker-node -b -a 'nft list ruleset' > 10_nft_list_ruleset_before
ansible my-worker-node -b -m service -a 'name=netbird state=restarted'
ansible my-worker-node -b -a 'iptables-save' > 20_iptables-save_after_netbird_restart
ansible my-worker-node -b -a 'nft list ruleset' > 20_nft_list_ruleset_after_netbird_restart
ansible my-worker-node -b -m service -a 'name=k3s-agent state=restarted'
ansible my-worker-node -b -a 'iptables-save' > 30_iptables-save_after_k3s_agent_restart
ansible my-worker-node -b -a 'nft list ruleset' > 30_nft_list_ruleset_after_k3s_agent_restart
for file in ??_iptables-save_*; do grep -v -e '^#' -e '^*' -e 'COMMIT' < $file | sort > $file.sorted; done
for file in ??_nft_list_ruleset_*; do grep -v -e '^#' -e '^\s*$' -e '^\s*table' < $file | sort > $file.sorted; done

This results in a bunch of *.sorted files which I could compare using text diff. The result was that the only difference between the files 10_*.sorted and 20_*.sorted was in the packet counters. Yet, the pod-to-pod communication is definitely broken after restarting the netbird service. So now we know that it has nothing to do with iptables/nftables rules. I'm sorry for the misleading title of this issue.

Another mistake I have done in my original posting is to say that a restart of the netbird service breaks kubectl logs: That's not correct - this command still works (it doesn't require flannel). However, pod-to-pod communication is definitely broken. In our case, a client could not connect to another service anymore. After a final restart of the k3s-agent service, it works again.

Summing it up, a restart of the netbird service does break flannel, although the information given in my original posting is not exactly correct. I hope this information helps to reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants