Skip to content

Commit

Permalink
test: apply mitigations in Linux 6.1 to make boot times fast
Browse files Browse the repository at this point in the history
Linux 6.1 brought a guest boottime performance regression.

* Cause

There are two factors that cause this issue:

1. In the implementation of the mitigation for the iTLB multihit
vulnerability, KVM creates a worker thread called kvm-nx-lpage-recovery.
This thread is responsible for recovering huge pages split when the
mitigation kicks-in. In the process of creating this thread, KVM calls
`cgroup_attach_task_all()` to move it to the same cgroup used by the
hypervisor thread

2. In kernel v4.4, upstream converted a cgroup per process read-write
semaphore into a per-cpu read-write semaphore to allow to perform
operations across multiple processes (commit
1ed1328792ff46e4bb86a3d7f7be2971f4549f6c). It was found that this
conversion introduced high latency for write paths, which mainly
includes moving tasks between cgroups. This was fixed in kernel v4.9 by
commit 3942a9bd7b5842a924e99ee6ec1350b8006c94ec which chose to favor
writers over readers since moving tasks between cgroups is a common
operation for Android. However, In kernel 6.0, upstream decided to
revert back again and favor readers over writers re-introducing the
original behavior of the rw semaphore (commit
6a010a49b63ac8465851a79185d8deff966f8e1a). At the same time, this commit
provided an option called favordynmods to favor writers over readers.

Since the kvm-nx-lpage-recovery thread creation and its cgroup change is
done in the KVM_CREATE_VM call, the high latency we observe in 6.1 is
due to the upstream decision to favor readers over writers for this
per-cpu rw semaphore. While the 4.14 and 5.10 kernels favor writers over
readers.

* Solution

There's two solutions for this issue:

1. If the CPU is not vulnerable to iTLB multihit vulnerability, the best
solution is to disable the mitigation with the newly added KVM option
`nx_huge_pages=never`. This entirely avoids the situation and may also
gain some additional nanoseconds in `KVM_CREATE_VM` since no threads
will be created. Note that this is also the KVM upstream recommended
solution ([here](https://lore.kernel.org/kvm/[email protected]/))

2. If the CPU is vulnerable to iTLB multihit, then the mitigation can't
be disabled. In this case, we have to use the `favordynmods` option.
There are two cases:

  - AL2023 (cgroup v2): Just remount the cgroup mount point with:
    `sudo mount -oremount,favordynmods /sys/fs/cgroup`
    - **IMPORTANT**: The 6.1 kernel has an issue where `favordynmods`
      won't work when the cpuset cgroups is enabled. This is the case
      in our CI because we installs docker (which enables
      the cpuset cgroup by default). This is now fixed in 6.1.50.

  - AL2 (cgroup v1): cgroup v1 doesn't support changing mount flags
    during remount. Use a new option to enable favordynmods during
    boot (which works for cgroup v1 and v2)

Signed-off-by: Pablo Barbáchano <[email protected]>
Co-authored-by: Luiz Capitulino <[email protected]>
  • Loading branch information
pb8o and luizcap committed Oct 5, 2023
1 parent b3b3cab commit dbd7ad4
Show file tree
Hide file tree
Showing 5 changed files with 104 additions and 12 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

### Added

- [#3837](https://github.com/firecracker-microvm/firecracker/issues/3837): Added
official support for Linux 6.1. See
[prod-host-setup](./docs/prod-host-setup.md) for some security and performance
considerations.
- [#4045](https://github.com/firecracker-microvm/firecracker/pull/4045)
and [#4075](https://github.com/firecracker-microvm/firecracker/pull/4075):
Added `snapshot-editor` tool for modifications of snapshot files.
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,9 +132,9 @@ We test all combinations of:

| Instance | Host OS & Kernel | Guest Rootfs | Guest Kernel |
| :--------- | :----------------- | :------------- | :------------- |
| m5d.metal | al2 linux_4.1 | ubuntu 18.04 | linux_4.14 |
| m5d.metal | al2 linux_4.1 | ubuntu 22.04 | linux_4.14 |
| m6i.metal | al2 linux_5.10 | | linux_5.10 |
| m6a.metal | | | |
| m6a.metal | al2023 linux_6.1 | | |
| m6g.metal | | | |
| c7g.metal | | | |

Expand Down
48 changes: 48 additions & 0 deletions docs/prod-host-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,3 +328,51 @@ wget -O - https://meltdown.ovh | bash

[1]: https://elixir.free-electrons.com/linux/v4.14.203/source/virt/kvm/arm/hyp/timer-sr.c#L63
[2]: https://lists.cs.columbia.edu/pipermail/kvmarm/2017-January/023323.html

### Linux 6.1 boot time regressions

Linux 6.1 introduced some regressions[^1] in the time it takes to boot a VM, for
the x86_64 architecture. They can be mitigated depending on the CPU and the
version of cgroups in use.

[^1]: See
explanation in commit 8d2745383b0e4e0196bab8492f796fbf2e402a98

#### If the host is vulnerable to iTLB multihit

To check if the host is vulnerable, look at the value of
`/sys/devices/system/cpu/vulnerabilities/itlb_multihit`. If it does not say `Not
affected`, the host is vulnerable.

The mitigation in this case is to enable `favordynmods` in cgroupsv1 or cgroups v2

For cgroupsv2, run this command:

```sh
sudo mount -o remount,favordynmods /sys/fs/cgroup
```

For cgroupsv1, remounting with `favordynmods` is not supported, so it has to be
done at boot time, through a kernel command line option[^2]. Add this to your GRUB
`cgroup_favordynmods=true`. Refer to your distribution documentation for where
to make this change.

[^2]: this command line option is still unreleased at the moment of writing, but
will be part of 6.7 and may be backported to 6.1:
https://lore.kernel.org/lkml/[email protected]/

#### If the host is not vulnerable to iTLB multihit

This mitigation is preferred to the other mitigation since it is less invasive
(it doesn't affect other cgroups), but it can also be combined with the cgroups
mitigation.

```sh
KVM_VENDOR_MOD=$(lsmod |grep -P "^kvm_(amd|intel)" | awk '{print $1}')
sudo modprobe -r $KVM_VENDOR_MOD kvm
sudo modprobe kvm nx_huge_pages=never
sudo modprobe $KVM_VENDOR_MOD
```

To validate that the change took effect, the file
`/sys/module/kvm/parameters/nx_huge_pages` should say `never`.
10 changes: 0 additions & 10 deletions tests/integration_tests/performance/test_boottime.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,6 @@ def test_no_boottime(test_microvm_with_api):
assert not timestamps


# temporarily disable this test in 6.1
@pytest.mark.xfail(
global_props.host_linux_version == "6.1",
reason="perf regression under investigation",
)
@pytest.mark.skipif(
global_props.cpu_codename == "INTEL_SKYLAKE"
and global_props.host_linux_version == "5.10",
Expand All @@ -84,11 +79,6 @@ def test_boottime_no_network(fast_microvm, record_property, metrics):
), f"boot time {boottime_us} cannot be greater than: {MAX_BOOT_TIME_US} us"


# temporarily disable this test in 6.1
@pytest.mark.xfail(
global_props.host_linux_version == "6.1",
reason="perf regression under investigation",
)
@pytest.mark.skipif(
global_props.cpu_codename == "INTEL_SKYLAKE"
and global_props.host_linux_version == "5.10",
Expand Down
50 changes: 50 additions & 0 deletions tools/devtool
Original file line number Diff line number Diff line change
Expand Up @@ -531,6 +531,53 @@ ensure_ci_artifacts() {
fi
}

apply_linux_61_tweaks() {
KV=$(uname -r)
if [[ $KV != 6.1.* ]] || [ $(uname -m) != x86_64 ]; then
return
fi
say "Applying Linux 6.1 boot-time regression mitigations"

KVM_VENDOR_MOD=$(lsmod |grep -P "^kvm_(amd|intel)" | awk '{print $1}')
ITLB_MULTIHIT=/sys/devices/system/cpu/vulnerabilities/itlb_multihit
NX_HUGEPAGES=/sys/module/kvm/parameters/nx_huge_pages

# If m6a/m6i
if grep -q "Not affected" $ITLB_MULTIHIT; then
echo -e "CPU not vulnerable to iTLB multihit, using kvm.nx_huge_pages=never mitigation"
# we need a lock so another process is not running the same thing and to
# avoid race conditions.
lockfile="/tmp/.linux61_tweaks.lock"
set -C # noclobber
while true; do
if echo "$$" > "$lockfile"; then
echo "Successfully acquired lock"
if ! grep -q "never" $NX_HUGEPAGES; then
echo "Reloading KVM modules with nx_huge_pages=never"
sudo modprobe -r $KVM_VENDOR_MOD kvm
sudo modprobe kvm nx_huge_pages=never
sudo modprobe $KVM_VENDOR_MOD
fi
rm "$lockfile"
break
else
sleep 5s
fi
done
tail -v $ITLB_MULTIHIT $NX_HUGEPAGES
# else (m5d Skylake and CascadeLake)
else
echo "CPU vulnerable to iTLB_multihit, checking if favordynmods is enabled"
mount |grep cgroup |grep -q favordynmods
if [ $? -ne 0 ]; then
say_warn "cgroups' favordynmods option not enabled; VM creation performance may be impacted"
else
echo "favordynmods is enabled"
fi
fi
}


# `$0 test` - run integration tests
# Please see `$0 help` for more information.
#
Expand Down Expand Up @@ -565,9 +612,12 @@ cmd_test() {
ensure_build_dir
ensure_ci_artifacts

apply_linux_61_tweaks

# If we got to here, we've got all we need to continue.
say "Kernel version: $(uname -r)"
say "$(sed '/^processor.*: 0$/,/^processor.*: 1$/!d; /^processor.*: 1$/d' /proc/cpuinfo)"
say "RPM microcode_ctl version: $(rpm -q microcode_ctl)"
say "Starting test run ..."

# Testing (running Firecracker via the jailer) needs root access,
Expand Down

0 comments on commit dbd7ad4

Please sign in to comment.