Skip to content

Commit

Permalink
Revert "coreos-teardown-initramfs: use NetworkManager to teardown net…
Browse files Browse the repository at this point in the history
…working"

This reverts commit b8607da.

We're seeing around 75% failure rate with this patch when provisioning
RHCOS on AWS with m4 instance types. The error message is:

    Error: failed to set networking: Timeout was reached

I confirmed that D-Bus and NM are definitely still running when we issue
the `nmcli` call. Still, I think there's a subtle race somewhere (likely
combined with the specific network adapter of m4 instance types), that
we need to plug. We need more visibility into why NM isn't responding.
Anyway, for now let's revert this to not regress.

Fixes: https://issues.redhat.com/browse/OCPBUGS-17154
Probably fixes: https://issues.redhat.com/browse/OCPBUGS-19870
  • Loading branch information
jlebon committed Oct 5, 2023
1 parent 7bde417 commit 38bdb42
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 19 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ DefaultDependencies=false
Before=ignition-mount.service
Before=ignition-complete.target

# Since we are tearing down networking using NetworkManager itself
# we need to make sure NetworkManager is running. Similarly
# dbus-broker since nmcli talks to the daemon via DBUS. Since the
# ordering for ExecStop is the opposite of ExecStart we need to use
# `After=` here.
After=nm-initrd.service
After=dbus-broker.service
# Since we are tearing down networking we need to make sure
# NetworkManager has been stopped, otherwise it'll be trying
# to react to our delete/down operations. Since the ordering
# for ExecStop is the opposite of ExecStart we need to use
# `Before=nm-initrd.service`.
# https://issues.redhat.com/browse/OCPBUGS-11052
Before=nm-initrd.service

# Make sure ExecStop= runs before we switch root
Conflicts=initrd-switch-root.target umount.target
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,15 +182,55 @@ propagate_ifname_udev_rules() {
fi
}

down_interface() {
echo "info: taking down network device: $1"
# On recommendation from the NM team let's try to delete the device
# first and if that doesn't work then set it to down and flush any
# associated addresses. Deleting virtual devices (bonds, teams, bridges,
# ip-tunnels, etc) will clean up any associated kernel resources. A real
# device can't be deleted so that will fail and we'll fallback to setting
# it down and flushing addresses.
if ! ip link delete $1; then
ip link set $1 down
ip addr flush dev $1
fi
}

# Iterate through the interfaces in the machine and take them down.
# Note that in the future we would like to possibly use `nmcli` networking off`
# for this. See the following two comments for details:
# https://github.com/coreos/fedora-coreos-tracker/issues/394#issuecomment-599721763
# https://github.com/coreos/fedora-coreos-tracker/issues/394#issuecomment-599746049
down_interfaces() {
if ! [ -z "$(ls /sys/class/net)" ]; then
for f in /sys/class/net/*; do
interface=$(basename "$f")
# The `bonding_masters` entry is not a true interface and thus
# cannot be taken down. Also skip local loopback
case "$interface" in
"lo" | "bonding_masters")
continue
;;
esac
# When we start taking down devices some other devices can
# start to disappear (for example vlan on top of interface).
# If the device we're about to take down has disappeared
# since the start of this loop then skip taking it down.
if [ ! -e "$f" ]; then
echo "info: skipping teardown of ${interface}; no longer exists."
continue
fi
down_interface $interface
done
fi
}

main() {
# Load libraries from dracut
load_dracut_libs

# Take down all networking set up in the initramfs
if systemctl is-active --quiet nm-initrd.service; then
echo "info: taking down initramfs networking"
nmcli networking off
fi
# Take down all interfaces set up in the initramfs
down_interfaces

# Clean up all routing
echo "info: flushing all routing"
Expand All @@ -209,13 +249,10 @@ main() {
propagate_ifname_udev_rules
fi

# Configuration has been propagated, but we can't clean up
# /run/NetworkManager because NM is still running. Let's drop
# down a tmpfiles.d snippet so that it's cleaned up first thing
# in the real root. Doing it this way prevents us having to write
# another unit to do the cleanup after this service has finished.
echo "R! /run/NetworkManager - - - - -" > \
/run/tmpfiles.d/15-teardown-initramfs-networkmanager.conf
# Now that the configuration has been propagated (or not)
# clean it up so that no information from outside of the
# real root is passed on to NetworkManager in the real root
rm -rf /run/NetworkManager/

rm -f /run/udev/rules.d/80-coreos-boot-disk.rules
rm -f /dev/disk/by-id/coreos-boot-disk
Expand Down

0 comments on commit 38bdb42

Please sign in to comment.