Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

route: properly handle multiple kernel prefix routes with same dst #385

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KanjiMonster
Copy link
Contributor

@KanjiMonster KanjiMonster commented May 14, 2024

When configuring IP(v6) addresses on interfaces, the kernel will install appropriate prefix routes (except when using noprefixroute).

When configuring addresses from the same subnet on multiple interfaces, the kernel will happily configure multiple prefix routes. Most prominent example are per interface IPv6 link local fe80::/64 routes.

Currently libnl treats these routes as identical, and thus will only keep the newest one of the routes in the cache (as each additional one is treated as an update to the previous route).

Therefore we need to extend route_id_attrs_get() to properly treat these routes as different:

  • For IPv4, these routes will have a preferred source set.
  • For IPv6, these routes will have a single nexthop pointing to the interface (OIF).

Also, only kernel may create these routes, attempts from userspace will be rejected when trying to add a second route for the same prefix.

ip route output:

$ ip r show table all
...
10.0.0.0/24 dev tun0 proto kernel scope link src 10.0.0.1
10.0.0.0/24 dev enp0s31f6 proto kernel scope link src 10.0.0.2
...
broadcast 10.0.0.255 dev tun0 table local proto kernel scope link src 10.0.0.1
broadcast 10.0.0.255 dev enp0s31f6 table local proto kernel scope link src 10.0.0.2
...
fe80::/64 dev tun0 proto kernel metric 256 pref medium
fe80::/64 dev enp0s31f6 proto kernel metric 1024 pref medium
...
multicast ff00::/8 dev enp0s31f6 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev tun0 table local proto kernel metric 256 pref medium

$ ip r show table all | wc -l
37

Before:

$ ./src/nl-route-list
...
inet 10.0.0.0/24 table main type unicast via dev tun0
...
inet 10.0.0.255 table local type broadcast via dev tun0
...
inet6 fe80::/64 table main type unicast via dev tun0
inet6 fe80::/64 table main type unicast via dev enp0s31f6
...
inet6 ff00::/8 table local type multicast via dev enp0s31f6

$ ./src/nl-route-list | grep -v cache | wc -l
34

After:

$ ./src/nl-route-list
...
inet 10.0.0.0/24 table main type unicast via dev tun0
inet 10.0.0.0/24 table main type unicast via dev enp0s31f6
...
inet6 fe80::/64 table main type unicast via dev tun0
inet6 fe80::/64 table main type unicast via dev enp0s31f6
...
inet6 ff00::/8 table local type multicast via dev enp0s31f6
inet6 ff00::/8 table local type multicast via dev tun0

$ ./src/nl-route-list | grep -v cache | wc -l
37

(ip route doesn't show cache routes)

When configuring IP(v6) addresses on interfaces, the kernel will install
appropriate prefix routes (except when using noprefixroute).

When configuring addresses from the same subnet on multiple interfaces,
the kernel will happily configure multiple prefix routes. Most prominent
example are per interface IPv6 link local fe80::/64 routes.

Currently libnl treats these routes as identical, and thus will only
keep the newest one of the routes in the cache (as each additional one
is treated as an update to the previous route).

Therefore we need to extend route_id_attrs_get() to properly treat these
routes as different:

* For IPv4, these routes will have a preferred source set.
* For IPv6, these routes will have a single nexthop pointing to the
  interface (OIF).

Also, only kernel may create these routes, attempts from userspace will
be rejected when trying to add a second route for the same prefix.

ip route output:
$ ip r show table all
...
10.0.0.0/24 dev tun0 proto kernel scope link src 10.0.0.1
10.0.0.0/24 dev enp0s31f6 proto kernel scope link src 10.0.0.2
...
broadcast 10.0.0.255 dev tun0 table local proto kernel scope link src 10.0.0.1
broadcast 10.0.0.255 dev enp0s31f6 table local proto kernel scope link src 10.0.0.2
...
fe80::/64 dev tun0 proto kernel metric 256 pref medium
fe80::/64 dev enp0s31f6 proto kernel metric 1024 pref medium
...
multicast ff00::/8 dev enp0s31f6 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev tun0 table local proto kernel metric 256 pref medium

$ ip r show table all | wc -l
37

Before:
$ ./src/nl-route-list
...
inet 10.0.0.0/24 table main type unicast via dev tun0
...
inet 10.0.0.255 table local type broadcast via dev tun0
...
inet6 fe80::/64 table main type unicast via dev tun0
inet6 fe80::/64 table main type unicast via dev enp0s31f6
...
inet6 ff00::/8 table local type multicast via dev enp0s31f6

$ ./src/nl-route-list | grep -v cache | wc -l
34

After:
$ ./src/nl-route-list
...
inet 10.0.0.0/24 table main type unicast via dev tun0
inet 10.0.0.0/24 table main type unicast via dev enp0s31f6
...
inet6 fe80::/64 table main type unicast via dev tun0
inet6 fe80::/64 table main type unicast via dev enp0s31f6
...
inet6 ff00::/8 table local type multicast via dev enp0s31f6
inet6 ff00::/8 table local type multicast via dev tun0

$ ./src/nl-route-list | grep -v cache | wc -l
37

(ip route doesn't show cache routes)

Signed-off-by: Jonas Gorski <[email protected]>
@@ -387,6 +387,39 @@ static uint32_t route_id_attrs_get(struct nl_object *obj)
if (route->rt_family == AF_MPLS)
rv &= ~ROUTE_ATTR_PRIO;

if (route->rt_protocol == RTPROT_KERNEL) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not special to proto kernel routes.

I agree, that "what is the ID of a route" needs to be adjusted. I think it should just follow what NetworkManager does (here and here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's special in the case that only kernel can install "duplicate" routes for the same destination which have the same priority. Anybody else can only do that by setting different priorities (which is already part of the id attrs, so will be treated as unique routes).

At least in my limited testing.

Copy link
Owner

@thom311 thom311 May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you use ip route append or ip route prepend?

(EDIT: prepend doesn't exist)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#!/bin/bash

set -x

ip netns del x 2>/dev/null || :
ip netns add x
ip -n x link add v0 type veth peer v1
ip -n x link set v0 up

ip -n x addr add 1:2:3::6/64 dev v0

ip -n x route append 1:2:3::99/128 via 1:2:3::1
ip -n x route append 1:2:3::99/128 via 1:2:3::2

ip -n x -6 addr
ip -n x -6 route

Copy link
Contributor Author

@KanjiMonster KanjiMonster May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't try append, that indeed did work:

$ip address add 10.0.0.1/24 dev tun0 noprefixroute
$ip address add 10.0.0.2/24 dev enp0s31f6 noprefixroute
$ip route add 10.0.0.0/24 dev tun0 src 10.0.0.1
$ip route append 10.0.0.0/24 dev enp0s31f6 src 10.0.0.2

Still presented to user space as multiple routes though:

$ ip r
10.0.0.0/24 dev tun0 scope link src 10.0.0.1
10.0.0.0/24 dev enp0s31f6 scope link src 10.0.0.2 

Likewise for IPv6:

ip netns del x >/dev/null
ip netns add x
ip -n x link add v0 type veth peer v1
ip -n x link set v0 up
ip -n x link add v3 type veth peer v2
ip -n x link set v3 up

ip -n x addr add 2001:db8:1::1/64 dev v0
ip -n x addr add 2001:db8:1::2/64 dev v3

ip -n x -6 addr
ip -n x -6 route

=>

10: v0@v1: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state LOWERLAYERDOWN qlen 1000
    inet6 2001:db8:1::1/64 scope global tentative 
       valid_lft forever preferred_lft forever
12: v3@v2: <BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state UNKNOWN qlen 1000
    inet6 2001:db8:1::2/64 scope global tentative 
       valid_lft forever preferred_lft forever
2001:db8:1::/64 dev v0 proto kernel metric 256 linkdown pref medium
2001:db8:1::/64 dev v3 proto kernel metric 256 linkdown pref medium

(I have no idea what the v* mean so I'm just cargo culting there).

So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?

Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE, not as a single route with multiple nexthops (or a single nexthop group).

Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH to the ID attributes is done to trigger a nexthop comparison by route_compare(), without it ignores the difference in the (single) nexthop and treats the routes as identical.

So this is only as a "please also check that the nexthop (= inferface) is identical". Not meant that these routes are actually MULTIPATH routes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this special treatment for IPv6 ECMP/multhop does not apply to IPv4 routes. In IPv4, a route can have 1 or more next-hops, and those next-hops are just part of the identity of the route. If you add a second IPv4 route, that differs from the first only by the list of nexthops, then simply a new route is added (and you have two similar IPv4 routes with different nexthop lists). Quite easy. On RTM_NEWROUTE message corresponds to one route object.

ip -n x addr add 2001:db8:1::2/64 dev v3

here you added onlink-routes. I guess, this merging of next-hops does not happen then (as you see). Compare my example, where the nexthops are gateways, then the merging happens.

What maybe would be the right way to model the IPv6 behaviors, is that IPv6 route objects are always only singlehop. Except, that one RTM_NEWROUTE message then can notify about multiple(!) single-hop routes (instead of one multi-hop routes). I did that in NetworkManager. Anyway, that is not what libnl currently does. It would be quite a change in behavior. What it does instead, it tries to search the cache for "similar" routes, and mangle them according to nexthop updates. That seems a questionable approach. It also seems hard to do efficiently (we cannot do a dictionary lookup, if the object's identity contains the next-hop list, but we don't know the object we search for).

So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?

I think the first step is to extend the set of ID attributes. I would just follow what NetworkManager does. We can also add one ID attribute at a time, if you don't want to bother to investigate them all at once. What I think was wrong, is to tie that to if (route->rt_protocol == RTPROT_KERNEL).

Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE, not as a single route with multiple nexthops (or a single nexthop group).

Yes. Because they are on-link, right?

Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH to the ID attributes is done to trigger a nexthop comparison by route_compare(), without it ignores the difference in the (single) nexthop and treats the routes as identical.

that sounds right. Especially for IPv4.

And for IPv6, I think the proper solution would be that a route object always only has one next hop. In that case, honoring ROUTE_ATTR_MULTIPATH would also be correct (albeit there is nothing "multi"). I think in step 1, we can ignore that problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess there will be three types of routes:

  • no nexthop / synthetic nexthop based on RTA_OIF, RTF_VIA etc
  • multipath with 1 or more nexthops
  • nh id referencing a nexthop (group)

And via ip route append you can have multiple routes out of these. E.g. I managed to setup:

$ ip -n x -6 route show 2001:db8:3::/64
2001:db8:3::/64 nhid 20 metric 1024 pref medium
	nexthop via 2001:db8:1::2 dev v0 weight 1 
	nexthop via 2001:db8:2::2 dev v3 weight 1 
2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium
2001:db8:3::/64 via 2001:db8:2::3 dev v3 metric 1024 pref medium

where I maybe even found some weird kernel behavior With this setup, if I do

ip -n x -6 route append 2001:db8:3::/64 nexthop via 2001:db8:1::2 dev v0 nexthop via 2001:db8:2::3 dev v3
RTNETLINK answers: File exists

So the kernel says no, but it didn't do nothing:

$ ip -n x -6 route show 2001:db8:3::/64
2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium
2001:db8:3::/64 metric 1024 pref medium
	nexthop via 2001:db8:2::3 dev v3 weight 1 
	nexthop via 2001:db8:1::2 dev v0 weight 1 

It added the route regardless, and deleted the one referencing nh_id 20 and the "single path" route with direct via.

Feels like it shouldn't modify the configuration and return an error. Will need to investigate if this behavior is intended/expected.


/*
* For IPv6 addresses, the prefix routes will have
* a single dev nexthop.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IPv6 multihop routes need a special treatment to make sense with a concept of "identity" for route objects.

It's clear that the next-hop is part of the identity. If you add a route A with a certain next hop, and then add another route B that only differs by their nexthop, then kernels RTM_NEWROUTE message will pretend this is one route C with two next hops. So what happend, was that adding route B caused route A to be replaced by route C. That makes only sense, if we think that IPv6 routes can only have one nexthop. With the minor exception, that one RTM_NEWROUTE/RTM_DELROUTE can affect multiple singlehop routes at once.

The idea that one RTM_DELROUTE modifies an existing route's nexthop list (thereby creating another route altogether) makes IMO no sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least the IPv6LL routes are treated as unique routes and not a single ECMP route:

$ strace ip -6 r show 
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=
...
/* inet6 fe80::/64 table main type unicast via dev tun0 */
[{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 256], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("tun0")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]],

/* inet6 fe80::/64 table main type unicast via dev enp0s31f6 */
[{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 1024], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("enp0s31f6")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]]
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, and for some reason the priorities are different on my system (likely because one if on a VPN tun device), which is why before even shows them as different.

@KanjiMonster KanjiMonster marked this pull request as draft May 17, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants