Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

route: properly handle multiple kernel prefix routes with same dst #385

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions lib/route/route_obj.c
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,39 @@ static uint32_t route_id_attrs_get(struct nl_object *obj)
if (route->rt_family == AF_MPLS)
rv &= ~ROUTE_ATTR_PRIO;

if (route->rt_protocol == RTPROT_KERNEL) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not special to proto kernel routes.

I agree, that "what is the ID of a route" needs to be adjusted. I think it should just follow what NetworkManager does (here and here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's special in the case that only kernel can install "duplicate" routes for the same destination which have the same priority. Anybody else can only do that by setting different priorities (which is already part of the id attrs, so will be treated as unique routes).

At least in my limited testing.

Copy link
Owner

@thom311 thom311 May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you use ip route append or ip route prepend?

(EDIT: prepend doesn't exist)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#!/bin/bash

set -x

ip netns del x 2>/dev/null || :
ip netns add x
ip -n x link add v0 type veth peer v1
ip -n x link set v0 up

ip -n x addr add 1:2:3::6/64 dev v0

ip -n x route append 1:2:3::99/128 via 1:2:3::1
ip -n x route append 1:2:3::99/128 via 1:2:3::2

ip -n x -6 addr
ip -n x -6 route

Copy link
Contributor Author

@KanjiMonster KanjiMonster May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't try append, that indeed did work:

$ip address add 10.0.0.1/24 dev tun0 noprefixroute
$ip address add 10.0.0.2/24 dev enp0s31f6 noprefixroute
$ip route add 10.0.0.0/24 dev tun0 src 10.0.0.1
$ip route append 10.0.0.0/24 dev enp0s31f6 src 10.0.0.2

Still presented to user space as multiple routes though:

$ ip r
10.0.0.0/24 dev tun0 scope link src 10.0.0.1
10.0.0.0/24 dev enp0s31f6 scope link src 10.0.0.2 

Likewise for IPv6:

ip netns del x >/dev/null
ip netns add x
ip -n x link add v0 type veth peer v1
ip -n x link set v0 up
ip -n x link add v3 type veth peer v2
ip -n x link set v3 up

ip -n x addr add 2001:db8:1::1/64 dev v0
ip -n x addr add 2001:db8:1::2/64 dev v3

ip -n x -6 addr
ip -n x -6 route

=>

10: v0@v1: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state LOWERLAYERDOWN qlen 1000
    inet6 2001:db8:1::1/64 scope global tentative 
       valid_lft forever preferred_lft forever
12: v3@v2: <BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 state UNKNOWN qlen 1000
    inet6 2001:db8:1::2/64 scope global tentative 
       valid_lft forever preferred_lft forever
2001:db8:1::/64 dev v0 proto kernel metric 256 linkdown pref medium
2001:db8:1::/64 dev v3 proto kernel metric 256 linkdown pref medium

(I have no idea what the v* mean so I'm just cargo culting there).

So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?

Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE, not as a single route with multiple nexthops (or a single nexthop group).

Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH to the ID attributes is done to trigger a nexthop comparison by route_compare(), without it ignores the difference in the (single) nexthop and treats the routes as identical.

So this is only as a "please also check that the nexthop (= inferface) is identical". Not meant that these routes are actually MULTIPATH routes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this special treatment for IPv6 ECMP/multhop does not apply to IPv4 routes. In IPv4, a route can have 1 or more next-hops, and those next-hops are just part of the identity of the route. If you add a second IPv4 route, that differs from the first only by the list of nexthops, then simply a new route is added (and you have two similar IPv4 routes with different nexthop lists). Quite easy. On RTM_NEWROUTE message corresponds to one route object.

ip -n x addr add 2001:db8:1::2/64 dev v3

here you added onlink-routes. I guess, this merging of next-hops does not happen then (as you see). Compare my example, where the nexthops are gateways, then the merging happens.

What maybe would be the right way to model the IPv6 behaviors, is that IPv6 route objects are always only singlehop. Except, that one RTM_NEWROUTE message then can notify about multiple(!) single-hop routes (instead of one multi-hop routes). I did that in NetworkManager. Anyway, that is not what libnl currently does. It would be quite a change in behavior. What it does instead, it tries to search the cache for "similar" routes, and mangle them according to nexthop updates. That seems a questionable approach. It also seems hard to do efficiently (we cannot do a dictionary lookup, if the object's identity contains the next-hop list, but we don't know the object we search for).

So I would think the (stop-gap?) solution is to do what I currently do for IPv6 routes also for IPv4 routes and then always, regardless of source/protocol?

I think the first step is to extend the set of ID attributes. I would just follow what NetworkManager does. We can also add one ID attribute at a time, if you don't want to bother to investigate them all at once. What I think was wrong, is to tie that to if (route->rt_protocol == RTPROT_KERNEL).

Also I'm trying to fix the case where the same prefix route is present on different interfaces, not multiple nexthops on a single interface. Where the kernel sends each prefix route as its own RTM_NEWROUTE, not as a single route with multiple nexthops (or a single nexthop group).

Yes. Because they are on-link, right?

Maybe I missed a explanatory comment there also, adding ROUTE_ATTR_MULTIPATH to the ID attributes is done to trigger a nexthop comparison by route_compare(), without it ignores the difference in the (single) nexthop and treats the routes as identical.

that sounds right. Especially for IPv4.

And for IPv6, I think the proper solution would be that a route object always only has one next hop. In that case, honoring ROUTE_ATTR_MULTIPATH would also be correct (albeit there is nothing "multi"). I think in step 1, we can ignore that problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess there will be three types of routes:

  • no nexthop / synthetic nexthop based on RTA_OIF, RTF_VIA etc
  • multipath with 1 or more nexthops
  • nh id referencing a nexthop (group)

And via ip route append you can have multiple routes out of these. E.g. I managed to setup:

$ ip -n x -6 route show 2001:db8:3::/64
2001:db8:3::/64 nhid 20 metric 1024 pref medium
	nexthop via 2001:db8:1::2 dev v0 weight 1 
	nexthop via 2001:db8:2::2 dev v3 weight 1 
2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium
2001:db8:3::/64 via 2001:db8:2::3 dev v3 metric 1024 pref medium

where I maybe even found some weird kernel behavior With this setup, if I do

ip -n x -6 route append 2001:db8:3::/64 nexthop via 2001:db8:1::2 dev v0 nexthop via 2001:db8:2::3 dev v3
RTNETLINK answers: File exists

So the kernel says no, but it didn't do nothing:

$ ip -n x -6 route show 2001:db8:3::/64
2001:db8:3::/64 nhid 11 via 2001:db8:2::2 dev v3 metric 1024 pref medium
2001:db8:3::/64 metric 1024 pref medium
	nexthop via 2001:db8:2::3 dev v3 weight 1 
	nexthop via 2001:db8:1::2 dev v0 weight 1 

It added the route regardless, and deleted the one referencing nh_id 20 and the "single path" route with direct via.

Feels like it shouldn't modify the configuration and return an error. Will need to investigate if this behavior is intended/expected.

/*
* If configuring Ip(v4) addresses for the same prefix on
* different interfaces, the kernel will install a
* prefix route for each interface with the ip address
* as preferred source.
*/
if (route->rt_family == AF_INET &&
route->rt_scope == RT_SCOPE_LINK)
rv |= ROUTE_ATTR_PREF_SRC;

/*
* For IPv6 addresses, the prefix routes will have
* a single dev nexthop.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IPv6 multihop routes need a special treatment to make sense with a concept of "identity" for route objects.

It's clear that the next-hop is part of the identity. If you add a route A with a certain next hop, and then add another route B that only differs by their nexthop, then kernels RTM_NEWROUTE message will pretend this is one route C with two next hops. So what happend, was that adding route B caused route A to be replaced by route C. That makes only sense, if we think that IPv6 routes can only have one nexthop. With the minor exception, that one RTM_NEWROUTE/RTM_DELROUTE can affect multiple singlehop routes at once.

The idea that one RTM_DELROUTE modifies an existing route's nexthop list (thereby creating another route altogether) makes IMO no sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least the IPv6LL routes are treated as unique routes and not a single ECMP route:

$ strace ip -6 r show 
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=
...
/* inet6 fe80::/64 table main type unicast via dev tun0 */
[{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 256], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("tun0")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]],

/* inet6 fe80::/64 table main type unicast via dev enp0s31f6 */
[{nlmsg_len=116, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI|NLM_F_DUMP_FILTERED, nlmsg_seq=1715693113, nlmsg_pid=682500}, {rtm_family=AF_INET6, rtm_dst_len=64, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN, rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0}, [[{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN], [{nla_len=20, nla_type=RTA_DST}, inet_pton(AF_INET6, "fe80::")], [{nla_len=8, nla_type=RTA_PRIORITY}, 1024], [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("enp0s31f6")], [{nla_len=36, nla_type=RTA_CACHEINFO}, {rta_clntref=0, rta_lastuse=0, rta_expires=0, rta_error=0, rta_used=0, rta_id=0, rta_ts=0, rta_tsage=0}], [{nla_len=5, nla_type=RTA_PREF}, 0]]]
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, and for some reason the priorities are different on my system (likely because one if on a VPN tun device), which is why before even shows them as different.

*/
if (route->rt_family == AF_INET6 && route->rt_nr_nh == 1) {
struct rtnl_nexthop *first;

first = nl_list_first_entry(&route->rt_nexthops,
struct rtnl_nexthop,
rtnh_list);

/*
* Only interface, no other values, so force
* nexthop comparison.
*/
if (!rtnl_route_nh_get_gateway(first) &&
!rtnl_route_nh_get_via(first) &&
!rtnl_route_nh_get_newdst(first))
rv |= ROUTE_ATTR_MULTIPATH;
}
}

return rv;
}

Expand Down
Loading