-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: use single SSH connection for lifetime of microvm #4955
Merged
+112
−431
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
My python version does not accept the precense of double quotes inside a double-quoted f-string. Use single quotes instead. Signed-off-by: Patrick Roy <[email protected]>
If a command times out in utils.run_cmd, we kill the subprocess and try to get whatever partial output it wrote so far to generate an error message. We do this using `communicate`, which waits for stdout/stderr to close. Some processes pass their stdout and stderr to their children. In this case, killing the parent won't close the stdout/stderr, and so the proc.communicate() call in run_cmd will wait indefinitely (or until whatever is holding on the the other end of the pipe exits). Avoid this by explicitly closing our end of the stdout/stderr pipes. Signed-off-by: Patrick Roy <[email protected]>
We asserted twice that the guest is not responsive anymore after pausing, and zero times that it is responsive again after resuming. Fix this by doing each once. Signed-off-by: Patrick Roy <[email protected]>
roypat
requested review from
zulinx86 and
pb8o
and removed request for
zulinx86
December 11, 2024 15:54
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4955 +/- ##
=======================================
Coverage 83.98% 83.98%
=======================================
Files 251 251
Lines 27889 27889
=======================================
Hits 23422 23422
Misses 4467 4467
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
roypat
added
the
Status: Awaiting review
Indicates that a pull request is ready to be reviewed
label
Dec 11, 2024
ShadowCurse
reviewed
Dec 11, 2024
It's a fairly tight timeout, and I was hitting it quite frequently while running functional tests with high parallelism. I don't really see any reason why we should be concerned about this specific command simply hitting the pytest timeout when things go wrong, so let's just remove the timeout. Signed-off-by: Patrick Roy <[email protected]>
Opportunistically set a 100s timeout. We don't have anything that should run this long, and by manually setting a timeout we avoid hitting the pytest timeout of 300s in case something _does_ go wrong (and hitting the pytest timeout gives significantly less debug information, so a manual timeout is preferred). Signed-off-by: Patrick Roy <[email protected]>
Remove the logging from this function, because SSHConnection._exec already does this. Signed-off-by: Patrick Roy <[email protected]>
Instead of creating new SSH connections every time we want to run a command inside the microvm, open a single daemonized ssh connection in the constructor of `SSHConnection`, and reuse it until we kill the microvm. Realize this using openssh's ControlMaster/ControlPersist functionality, which allows us to persist the first connection opened to a microvm and multiplex all subsequent commands over this one connection. We need some slight changes in test_pause_restore.py and test_net.py. In the former, since we no longer reconnect to the VM on every ssh command, we will now observe a timeout instead of a connection failure. For the latter, it turns out that @lru_cache treats .ssh_iface() and .ssh_iface(0) as distinct invokations, despire the iface_idx argument of ssh_iface defaulting to 0 (meaning they are actually the same function call). This caused a re-intialization of the SSHConnection object, which then triggered the assertion in _init_connection about the socket file not existing. Signed-off-by: Patrick Roy <[email protected]>
With the ControlMaster/ControlPersist approach, these are no needed anymore. I'm not reverting the commit that added them because that also contained an update of poetry.lock. I'm also not rebuilding the devctr for this because the dependencies will simply be dropped the next time we're rebuilding it. Signed-off-by: Patrick Roy <[email protected]>
Building the A/B binaries happens in a tmpdir, which we try to delete at the end. But we create files in the tmpdir from a privileged docker container, so to delete this tmpdir we also need elevated privileges Signed-off-by: Patrick Roy <[email protected]>
pb8o
approved these changes
Dec 11, 2024
zulinx86
approved these changes
Dec 12, 2024
Manciukic
added a commit
to Manciukic/firecracker
that referenced
this pull request
Dec 16, 2024
In firecracker-microvm#4955 the executable to check the ssh connection liveliness was changed from `true` to `/usr/bin/true`, but that is not its path in all rootfs, causing failures in the `test-populat-containers` suite. Also, since the error is retried but the control socket is not cleaned up, subsequent retries would fail for the assertion. This change fixes both issues by using the binary name `true` and cleaning up the control socket on error before the next retry. Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm") Signed-off-by: Riccardo Mancini <[email protected]>
4 tasks
Manciukic
added a commit
to Manciukic/firecracker
that referenced
this pull request
Dec 16, 2024
In firecracker-microvm#4955 the executable to check the ssh connection liveliness was changed from `true` to `/usr/bin/true`, but that is not its path in all rootfs, causing failures in the `test-populat-containers` suite. Also, since the error is retried but the daemon is not cleaned up, subsequent retries would fail for the assertion. This change fixes both issues by using the binary name `true` and stopping the daemon on error before the next retry. Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm") Signed-off-by: Riccardo Mancini <[email protected]>
Manciukic
added a commit
to Manciukic/firecracker
that referenced
this pull request
Dec 16, 2024
In firecracker-microvm#4955 the executable to check the ssh connection liveliness was changed from `true` to `/usr/bin/true`, but that is not its path in all rootfs, causing failures in the `test-populat-containers` suite. Also, since the error is retried but the daemon is not cleaned up, subsequent retries would fail for the assertion. This change fixes both issues by using the binary name `true` and stopping the daemon on error before the next retry. Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm") Signed-off-by: Riccardo Mancini <[email protected]>
roypat
pushed a commit
that referenced
this pull request
Dec 16, 2024
In #4955 the executable to check the ssh connection liveliness was changed from `true` to `/usr/bin/true`, but that is not its path in all rootfs, causing failures in the `test-populat-containers` suite. Also, since the error is retried but the daemon is not cleaned up, subsequent retries would fail for the assertion. This change fixes both issues by using the binary name `true` and stopping the daemon on error before the next retry. Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm") Signed-off-by: Riccardo Mancini <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of creating new SSH connections every time we want to run a
command inside the microvm, open a single daemonized ssh connection in
the constructor of
SSHConnection
, and reuse it until we kill themicrovm.
Realize this using openssh's ControlMaster/ControlPersist functionality,
which allows us to persist the first connection opened to a microvm and
multiplex all subsequent commands over this one connection.
We need some slight changes in test_pause_restore.py and test_net.py. In
the former, since we no longer reconnect to the VM on every ssh command,
we will now observe a timeout instead of a connection failure. For the
latter, it turns out that @lru_cache treats .ssh_iface() and
.ssh_iface(0) as distinct invokations, despire the iface_idx argument of
ssh_iface defaulting to 0 (meaning they are actually the same function
call). This caused a re-intialization of the SSHConnection object, which
then triggered the assertion in _init_connection about the socket file
not existing.
Lastly, fix up some other small issues that were causing intermittent failures I observed while testing this.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
tools/devtool checkstyle
to verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md
.Runbook for Firecracker API changes.
integration tests.
TODO
.rust-vmm
.