Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kpatch/kpatch: Optimize patch transition timing with a while loop #1408

Closed
wants to merge 1 commit into from

Conversation

oceansue
Copy link

Optimize the timing of patch transitions by using a while loop to check the transition status, replacing the for loop with a 1-second sleep. This approach reduces the likelihood of timeouts due to system load, potentially speeding up patch transitions.

@oceansue
Copy link
Author

issues: #1409

@oceansue
Copy link
Author

@joe-lawrence @jpoimboe PTAL. Thanks.
Fix SC2004 warnings in kpatch script.

@joe-lawrence
Copy link
Contributor

Hi @oceansue , a few questions and comments about the PR:

  • It removes the 1 second sleep from wait_for_patch_transition's loops around sysfs. As long as the transition isn't moving back and forth, I guess it might return the result slightly sooner (sub-second) at the expense of hammering on sysfs. Is that really a net win here? Or ...
  • I don't understand how the recoding of the for (( i=0; i<POST_ENABLE_WAIT; i++ )); do loop iteration is supposed to work. Is it intended to lengthen the polling time? I could see that helping to detect a "long" transition instance.
  • Code: I'm pretty sure that SECONDS isn't defined in the kpatch script. Missing code hunks?
  • Minor nit about shellcheck: if the commit introduces new code, just squash it into the introducing commit (no need for a separate commit). If the commit is correcting existing code, it may make sense to spin out to its own commit.

Optimize the timing of patch transitions by using a `while` loop to
check the transition status, replacing the `for` loop with a 1-second
sleep. This approach reduces the likelihood of timeouts due to system
load, potentially speeding up patch transitions.

Signed-off-by: oceansue <[email protected]>
Signed-off-by: Kernel Group <[email protected]>
@oceansue oceansue force-pushed the kpatch-check-transition branch from 9453dba to 9dde564 Compare August 28, 2024 03:38
@oceansue
Copy link
Author

Hi @joe-lawrence ,Thank you for your response:

  • The reason for removing the 1 second sleep was mentioned in the issues: kpatch load/unload patch transition has stalled #1409
  • In Bash, SECONDS is a built-in special variable that tracks the number of seconds that have elapsed since the shell started.This variable increments automatically. So, your understanding of the for (( i=0; i<POST_ENABLE_WAIT; i++ )); do is correct. Changing for (( i=0; i<POST_ENABLE_WAIT; i++ )); do to while [ $i -lt $POST_SIGNAL_WAIT ]; do will not compromise functionality. It will exit the while loop only after the timeout is reached(after 15 seconds).
  • I have already squashed the "Fix SC2004 warnings in kpatch script" commit into the previous commit.

I‘ve been considering whether there might be a better solution for this issue (#1409) on my system. But so far, This patch seems to be working well. However, It does result in a short-term increase in CPU load.

@joe-lawrence
Copy link
Contributor

Hi @oceansue :

  • Thanks for the explanation of the bash SECONDS variable -- that feature is new to me and looks like it could be handy to use!
  • Now I understand now why the patch changed the loop construct as it needs to iterate many times and not just from 1 to N in whole number increments.
  • That said, I still don't understand how the patch helps the case of a long patch transition (as noted in kpatch load/unload patch transition has stalled #1409, "the module transition time is over 1500 seconds"). As I understand the new code, both POST_ENABLE_WAIT and POST_SIGNAL_WAIT remain fixed at 15 and 60 seconds respectively. Only the polling frequency has increased (to as fast as possible). In that case, what is the difference if the code polls (and fails) once every 14 out of 15 seconds, or polls a million times for 14 seconds and then succeeds on the 15th second?

Given the long transition time reported in #1409, I might have expected a change like:

diff --git a/kpatch/kpatch b/kpatch/kpatch
index edfccfead917..0ca6aaf78fd1 100755
--- a/kpatch/kpatch
+++ b/kpatch/kpatch
@@ -26,9 +26,9 @@
 INSTALLDIR=/var/lib/kpatch
 SCRIPTDIR="$(readlink -f "$(dirname "$(type -p "$0")")")"
 VERSION="0.9.9"
-POST_ENABLE_WAIT=15    # seconds
-POST_SIGNAL_WAIT=60    # seconds
-MODULE_REF_WAIT=15     # seconds
+POST_ENABLE_WAIT="${POST_ENABLE_WAIT:-15}"     # seconds
+POST_SIGNAL_WAIT="${POST_SIGNAL_WAIT:-60}"     # seconds
+MODULE_REF_WAIT="${MODULE_REF_WAIT:-15}"       # seconds
 
 # How many times to try loading the patch if activeness safety check fails.
 MAX_LOAD_ATTEMPTS=5

which would let the user set POST_ENABLE_WAIT, POST_SIGNAL_WAIT, MODULE_REF_WAIT environment variables without the script overriding them (ie, $ POST_ENABLE_WAIT=1500 ./kpatch/kpatch).

@oceansue oceansue changed the title kpatch/kpatch: Optimize patch transition timing with a while loop WIP:kpatch/kpatch: Optimize patch transition timing with a while loop Sep 4, 2024
@oceansue oceansue changed the title WIP:kpatch/kpatch: Optimize patch transition timing with a while loop kpatch/kpatch: Optimize patch transition timing with a while loop Sep 27, 2024
@oceansue
Copy link
Author

oceansue commented Sep 28, 2024

I've been trying to fix this recently, but unfortunately, I haven't made any progress. It seems that this commit is the only one that can solve the problem I'm encountering.
I want to find a way to minimize the modification of code functionality and meet my needs.

This is the patch I'm considering.

diff --git a/kpatch/kpatch b/kpatch/kpatch
index edfccfe..815dacb 100755
--- a/kpatch/kpatch
+++ b/kpatch/kpatch
@@ -245,29 +245,32 @@ signal_stalled_processes() {

 wait_for_patch_transition() {
        local module="$1"
-       local i
+       local i=0

        in_transition "$module" || return 0

        echo "waiting (up to $POST_ENABLE_WAIT seconds) for patch transition to complete..."
-       for (( i=0; i<POST_ENABLE_WAIT; i++ )); do
+       start_time=$SECONDS
+       while [ $i -lt $POST_ENABLE_WAIT ]; do
                if ! in_transition "$module" ; then
                        echo "transition complete ($i seconds)"
+                      sleep 1s
                        return 0
                fi
-               sleep 1s
+               i=$(( SECONDS - start_time ))
        done

        echo "patch transition has stalled!"
        signal_stalled_processes

        echo "waiting (up to $POST_SIGNAL_WAIT seconds) for patch transition to complete..."
-       for (( i=0; i<POST_SIGNAL_WAIT; i++ )); do
+       start_time=$SECONDS
+       while [ $i -lt $POST_SIGNAL_WAIT ]; do
                if ! in_transition "$module" ; then
                        echo "transition complete ($i seconds)"
                        return 0
                fi
-               sleep 1s
+               i=$(( SECONDS - start_time ))
        done

        return 1

@joe-lawrence
Copy link
Contributor

@oceansue : with the minimal patch in place, what is reported by "transition complete ($i seconds)" ? Also, can you attach/inline an example kpatch that requires the modification? Thanks.

@oceansue
Copy link
Author

oceansue commented Oct 1, 2024

Test environment:
kernel: 4.19.180
The cpu is the company's own 32 cores

use #1408 (comment) patch:

sudo ./kpatch load ../kpatch-build/livepatch-cmdline.ko 
loading patch module: ../kpatch-build/livepatch-cmdline.ko
waiting (up to 15 seconds) for patch transition to complete...
patch transition has stalled!
signaling stalled process(es):
waiting (up to 60 seconds) for patch transition to complete...
transition complete (11 seconds)
ocean@ocean-PC:~/kpatch/kpatch$ sudo ./kpatch unload ../kpatch-build/livepatch-cmdline.ko 
disabling patch module: livepatch_cmdline
waiting (up to 15 seconds) for patch transition to complete...
patch transition has stalled!
signaling stalled process(es):
waiting (up to 60 seconds) for patch transition to complete...
transition complete (8 seconds)
unloading patch module: livepatch_cmdline

revert #1408 (comment) patch:

ocean@ocean-PC:~/kpatch/kpatch$ sudo ./kpatch load ../kpatch-build/livepatch-cmdline.ko 
module already loaded, re-enabling
waiting (up to 15 seconds) for patch transition to complete...
patch transition has stalled!
signaling stalled process(es):
waiting (up to 60 seconds) for patch transition to complete...

Stalled processes:
module livepatch_cmdline did not complete its transition, disabling...
waiting (up to 15 seconds) for patch transition to complete...
transition complete (2 seconds)
kpatch: error: failed to re-enable module livepatch_cmdline (transition stalled), patch disabled

For other CPUs, if the transition can be completed in 15 seconds, the patch will have no effect. If it cannot be completed in more than 15 seconds, the program will enter a 60 second transition period during which the program will frequently access the sys interface to speed up the transition.

Also, I found that the bug seems to be caused by multiple cores, and when I limit the number of cores to less than 8, the transformation can be done successfully. When the number of cores is more than 8, the transformation doesn't go so well and requires this patch to fix it.I was unable to find a multi-core (>32) processor with x86 architecture to verify.

@joe-lawrence
Copy link
Contributor

@jpoimboe - can you have a look at this report/patch, I think I'm just missing something on how/why it works.

@jpoimboe
Copy link
Member

For other CPUs, if the transition can be completed in 15 seconds, the patch will have no effect.

Hm? Look closer, the patch actually removes the 1s sleep even for the first 15 seconds. Hammering the CPU with sysfs reads is not going to be an acceptable solution.

Is this with the standard cmdline-string.patch? If so, I suspect something is going wrong on the kernel side, possibly livepatch itself. Maybe the unwinder is encountering errors. Kernel 4.19 is ancient, who knows what it could be...

@oceansue
Copy link
Author

@joe-lawrence @jpoimboe Thanks for the replies, the problem has been solved, according to the new commit from kernel.org, after applying it to the 4.19.0 kernel. Close the issue. Thanks again.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/livepatch/transition.c?h=v6.12-rc2&id=5de62ea84abd732ded7c5569426fd71c0420f83e

@oceansue oceansue closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants