route: properly handle multiple kernel prefix routes with same dst #385

KanjiMonster · 2024-05-14T11:54:26Z

When configuring IP(v6) addresses on interfaces, the kernel will install appropriate prefix routes (except when using noprefixroute).

When configuring addresses from the same subnet on multiple interfaces, the kernel will happily configure multiple prefix routes. Most prominent example are per interface IPv6 link local fe80::/64 routes.

Currently libnl treats these routes as identical, and thus will only keep the newest one of the routes in the cache (as each additional one is treated as an update to the previous route).

Therefore we need to extend route_id_attrs_get() to properly treat these routes as different:

For IPv4, these routes will have a preferred source set.
For IPv6, these routes will have a single nexthop pointing to the interface (OIF).

Also, only kernel may create these routes, attempts from userspace will be rejected when trying to add a second route for the same prefix.

ip route output:

$ ip r show table all
...
10.0.0.0/24 dev tun0 proto kernel scope link src 10.0.0.1
10.0.0.0/24 dev enp0s31f6 proto kernel scope link src 10.0.0.2
...
broadcast 10.0.0.255 dev tun0 table local proto kernel scope link src 10.0.0.1
broadcast 10.0.0.255 dev enp0s31f6 table local proto kernel scope link src 10.0.0.2
...
fe80::/64 dev tun0 proto kernel metric 256 pref medium
fe80::/64 dev enp0s31f6 proto kernel metric 1024 pref medium
...
multicast ff00::/8 dev enp0s31f6 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev tun0 table local proto kernel metric 256 pref medium

$ ip r show table all | wc -l
37

Before:

$ ./src/nl-route-list
...
inet 10.0.0.0/24 table main type unicast via dev tun0
...
inet 10.0.0.255 table local type broadcast via dev tun0
...
inet6 fe80::/64 table main type unicast via dev tun0
inet6 fe80::/64 table main type unicast via dev enp0s31f6
...
inet6 ff00::/8 table local type multicast via dev enp0s31f6

$ ./src/nl-route-list | grep -v cache | wc -l
34

After:

$ ./src/nl-route-list
...
inet 10.0.0.0/24 table main type unicast via dev tun0
inet 10.0.0.0/24 table main type unicast via dev enp0s31f6
...
inet6 fe80::/64 table main type unicast via dev tun0
inet6 fe80::/64 table main type unicast via dev enp0s31f6
...
inet6 ff00::/8 table local type multicast via dev enp0s31f6
inet6 ff00::/8 table local type multicast via dev tun0

$ ./src/nl-route-list | grep -v cache | wc -l
37

(ip route doesn't show cache routes)

When configuring IP(v6) addresses on interfaces, the kernel will install appropriate prefix routes (except when using noprefixroute). When configuring addresses from the same subnet on multiple interfaces, the kernel will happily configure multiple prefix routes. Most prominent example are per interface IPv6 link local fe80::/64 routes. Currently libnl treats these routes as identical, and thus will only keep the newest one of the routes in the cache (as each additional one is treated as an update to the previous route). Therefore we need to extend route_id_attrs_get() to properly treat these routes as different: * For IPv4, these routes will have a preferred source set. * For IPv6, these routes will have a single nexthop pointing to the interface (OIF). Also, only kernel may create these routes, attempts from userspace will be rejected when trying to add a second route for the same prefix. ip route output: $ ip r show table all ... 10.0.0.0/24 dev tun0 proto kernel scope link src 10.0.0.1 10.0.0.0/24 dev enp0s31f6 proto kernel scope link src 10.0.0.2 ... broadcast 10.0.0.255 dev tun0 table local proto kernel scope link src 10.0.0.1 broadcast 10.0.0.255 dev enp0s31f6 table local proto kernel scope link src 10.0.0.2 ... fe80::/64 dev tun0 proto kernel metric 256 pref medium fe80::/64 dev enp0s31f6 proto kernel metric 1024 pref medium ... multicast ff00::/8 dev enp0s31f6 table local proto kernel metric 256 pref medium multicast ff00::/8 dev tun0 table local proto kernel metric 256 pref medium $ ip r show table all | wc -l 37 Before: $ ./src/nl-route-list ... inet 10.0.0.0/24 table main type unicast via dev tun0 ... inet 10.0.0.255 table local type broadcast via dev tun0 ... inet6 fe80::/64 table main type unicast via dev tun0 inet6 fe80::/64 table main type unicast via dev enp0s31f6 ... inet6 ff00::/8 table local type multicast via dev enp0s31f6 $ ./src/nl-route-list | grep -v cache | wc -l 34 After: $ ./src/nl-route-list ... inet 10.0.0.0/24 table main type unicast via dev tun0 inet 10.0.0.0/24 table main type unicast via dev enp0s31f6 ... inet6 fe80::/64 table main type unicast via dev tun0 inet6 fe80::/64 table main type unicast via dev enp0s31f6 ... inet6 ff00::/8 table local type multicast via dev enp0s31f6 inet6 ff00::/8 table local type multicast via dev tun0 $ ./src/nl-route-list | grep -v cache | wc -l 37 (ip route doesn't show cache routes) Signed-off-by: Jonas Gorski <[email protected]>

thom311 · 2024-05-14T13:15:15Z

lib/route/route_obj.c

@@ -387,6 +387,39 @@ static uint32_t route_id_attrs_get(struct nl_object *obj)
 	if (route->rt_family == AF_MPLS)
 		rv &= ~ROUTE_ATTR_PRIO;

+	if (route->rt_protocol == RTPROT_KERNEL) {


this is not special to proto kernel routes.

I agree, that "what is the ID of a route" needs to be adjusted. I think it should just follow what NetworkManager does (here and here).

It's special in the case that only kernel can install "duplicate" routes for the same destination which have the same priority. Anybody else can only do that by setting different priorities (which is already part of the id attrs, so will be treated as unique routes).

At least in my limited testing.

did you use ip route append ~~or ip route prepend~~?

(EDIT: prepend doesn't exist)

#!/bin/bash set -x ip netns del x 2>/dev/null || : ip netns add x ip -n x link add v0 type veth peer v1 ip -n x link set v0 up ip -n x addr add 1:2:3::6/64 dev v0 ip -n x route append 1:2:3::99/128 via 1:2:3::1 ip -n x route append 1:2:3::99/128 via 1:2:3::2 ip -n x -6 addr ip -n x -6 route

Ah, I didn't try append, that indeed did work:

$ip address add 10.0.0.1/24 dev tun0 noprefixroute $ip address add 10.0.0.2/24 dev enp0s31f6 noprefixroute $ip route add 10.0.0.0/24 dev tun0 src 10.0.0.1 $ip route append 10.0.0.0/24 dev enp0s31f6 src 10.0.0.2

Still presented to user space as multiple routes though:

$ ip r 10.0.0.0/24 dev tun0 scope link src 10.0.0.1 10.0.0.0/24 dev enp0s31f6 scope link src 10.0.0.2

Likewise for IPv6:

ip netns del x >/dev/null ip netns add x ip -n x link add v0 type veth peer v1 ip -n x link set v0 up ip -n x link add v3 type veth peer v2 ip -n x link set v3 up ip -n x addr add 2001:db8:1::1/64 dev v0 ip -n x addr add 2001:db8:1::2/64 dev v3 ip -n x -6 addr ip -n x -6 route

=>

10: v0@v1: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state LOWERLAYERDOWN qlen 1000 inet6 2001:db8:1::1/64 scope global tentative valid_lft forever preferred_lft forever 12: v3@v2: <BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state UNKNOWN qlen 1000 inet6 2001:db8:1::2/64 scope global tentative valid_lft forever preferred_lft forever 2001:db8:1::/64 dev v0 proto kernel metric 256 linkdown pref medium 2001:db8:1::/64 dev v3 proto kernel metric 256 linkdown pref medium

(I have no idea what the v* mean so I'm just cargo culting there).

So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?

Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE, not as a single route with multiple nexthops (or a single nexthop group).

Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH to the ID attributes is done to trigger a nexthop comparison by route_compare(), without it ignores the difference in the (single) nexthop and treats the routes as identical.

So this is only as a "please also check that the nexthop (= inferface) is identical". Not meant that these routes are actually MULTIPATH routes.

Note that this special treatment for IPv6 ECMP/multhop does not apply to IPv4 routes. In IPv4, a route can have 1 or more next-hops, and those next-hops are just part of the identity of the route. If you add a second IPv4 route, that differs from the first only by the list of nexthops, then simply a new route is added (and you have two similar IPv4 routes with different nexthop lists). Quite easy. On RTM_NEWROUTE message corresponds to one route object.

ip -n x addr add 2001:db8:1::2/64 dev v3

here you added onlink-routes. I guess, this merging of next-hops does not happen then (as you see). Compare my example, where the nexthops are gateways, then the merging happens.

What maybe would be the right way to model the IPv6 behaviors, is that IPv6 route objects are always only singlehop. Except, that one RTM_NEWROUTE message then can notify about multiple(!) single-hop routes (instead of one multi-hop routes). I did that in NetworkManager. Anyway, that is not what libnl currently does. It would be quite a change in behavior. What it does instead, it tries to search the cache for "similar" routes, and mangle them according to nexthop updates. That seems a questionable approach. It also seems hard to do efficiently (we cannot do a dictionary lookup, if the object's identity contains the next-hop list, but we don't know the object we search for).

So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?

I think the first step is to extend the set of ID attributes. I would just follow what NetworkManager does. We can also add one ID attribute at a time, if you don't want to bother to investigate them all at once. What I think was wrong, is to tie that to if (route->rt_protocol == RTPROT_KERNEL).

Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE, not as a single route with multiple nexthops (or a single nexthop group).

Yes. Because they are on-link, right?

Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH to the ID attributes is done to trigger a nexthop comparison by route_compare(), without it ignores the difference in the (single) nexthop and treats the routes as identical.

that sounds right. Especially for IPv4.

And for IPv6, I think the proper solution would be that a route object always only has one next hop. In that case, honoring ROUTE_ATTR_MULTIPATH would also be correct (albeit there is nothing "multi"). I think in step 1, we can ignore that problem.

So I guess there will be three types of routes:

no nexthop / synthetic nexthop based on RTA_OIF, RTF_VIA etc

multipath with 1 or more nexthops

nh id referencing a nexthop (group)

And via ip route append you can have multiple routes out of these. E.g. I managed to setup:

$ ip -n x -6 route show 2001:db8:3::/64 2001:db8:3::/64 nhid 20 metric 1024 pref medium nexthop via 2001:db8:1::2 dev v0 weight 1 nexthop via 2001:db8:2::2 dev v3 weight 1 2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium 2001:db8:3::/64 via 2001:db8:2::3 dev v3 metric 1024 pref medium

where I maybe even found some weird kernel behavior With this setup, if I do

ip -n x -6 route append 2001:db8:3::/64 nexthop via 2001:db8:1::2 dev v0 nexthop via 2001:db8:2::3 dev v3 RTNETLINK answers: File exists

So the kernel says no, but it didn't do nothing:

$ ip -n x -6 route show 2001:db8:3::/64 2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium 2001:db8:3::/64 metric 1024 pref medium nexthop via 2001:db8:2::3 dev v3 weight 1 nexthop via 2001:db8:1::2 dev v0 weight 1

It added the route regardless, and deleted the one referencing nh_id 20 and the "single path" route with direct via.

Feels like it shouldn't modify the configuration and return an error. Will need to investigate if this behavior is intended/expected.

thom311 · 2024-05-14T13:22:12Z

lib/route/route_obj.c

+
+		/*
+		 * For IPv6 addresses, the prefix routes will have
+		 * a single dev nexthop.


I think IPv6 multihop routes need a special treatment to make sense with a concept of "identity" for route objects.

It's clear that the next-hop is part of the identity. If you add a route A with a certain next hop, and then add another route B that only differs by their nexthop, then kernels RTM_NEWROUTE message will pretend this is one route C with two next hops. So what happend, was that adding route B caused route A to be replaced by route C. That makes only sense, if we think that IPv6 routes can only have one nexthop. With the minor exception, that one RTM_NEWROUTE/RTM_DELROUTE can affect multiple singlehop routes at once.

The idea that one RTM_DELROUTE modifies an existing route's nexthop list (thereby creating another route altogether) makes IMO no sense.

at least the IPv6LL routes are treated as unique routes and not a single ECMP route:

$ strace ip -6 r show recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base= ... /* inet6 fe80::/64 table main type unicast via dev tun0 */ [{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 256], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("tun0")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]], /* inet6 fe80::/64 table main type unicast via dev enp0s31f6 */ [{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 1024], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("enp0s31f6")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]] ...

Ah right, and for some reason the priorities are different on my system (likely because one if on a VPN tun device), which is why before even shows them as different.

thom311 reviewed May 14, 2024

View reviewed changes

KanjiMonster marked this pull request as draft May 17, 2024 14:01

thom311 added the change-requested label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

route: properly handle multiple kernel prefix routes with same dst #385

route: properly handle multiple kernel prefix routes with same dst #385

KanjiMonster commented May 14, 2024 •

edited

Loading

thom311 May 14, 2024

KanjiMonster May 14, 2024

thom311 May 14, 2024 •

edited

Loading

thom311 May 14, 2024

KanjiMonster May 14, 2024 •

edited

Loading

thom311 May 14, 2024

KanjiMonster May 15, 2024

thom311 May 14, 2024

KanjiMonster May 14, 2024

KanjiMonster May 14, 2024

route: properly handle multiple kernel prefix routes with same dst #385

Are you sure you want to change the base?

route: properly handle multiple kernel prefix routes with same dst #385

Conversation

KanjiMonster commented May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thom311 May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KanjiMonster May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KanjiMonster commented May 14, 2024 •

edited

Loading

thom311 May 14, 2024 •

edited

Loading

KanjiMonster May 14, 2024 •

edited

Loading