Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: use single SSH connection for lifetime of microvm #4955

Merged
merged 10 commits into from
Dec 12, 2024

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Dec 11, 2024

Instead of creating new SSH connections every time we want to run a
command inside the microvm, open a single daemonized ssh connection in
the constructor of SSHConnection, and reuse it until we kill the
microvm.

Realize this using openssh's ControlMaster/ControlPersist functionality,
which allows us to persist the first connection opened to a microvm and
multiplex all subsequent commands over this one connection.

We need some slight changes in test_pause_restore.py and test_net.py. In
the former, since we no longer reconnect to the VM on every ssh command,
we will now observe a timeout instead of a connection failure. For the
latter, it turns out that @lru_cache treats .ssh_iface() and
.ssh_iface(0) as distinct invokations, despire the iface_idx argument of
ssh_iface defaulting to 0 (meaning they are actually the same function
call). This caused a re-intialization of the SSHConnection object, which
then triggered the assertion in _init_connection about the socket file
not existing.

Lastly, fix up some other small issues that were causing intermittent failures I observed while testing this.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

My python version does not accept the precense of double quotes inside a
double-quoted f-string. Use single quotes instead.

Signed-off-by: Patrick Roy <[email protected]>
If a command times out in utils.run_cmd, we kill the subprocess and try
to get whatever partial output it wrote so far to generate an error
message. We do this using `communicate`, which waits for stdout/stderr
to close.

Some processes pass their stdout and stderr to their children. In this
case, killing the parent won't close the stdout/stderr, and so the
proc.communicate() call in run_cmd will wait indefinitely (or until
whatever is holding on the the other end of the pipe exits). Avoid this
by explicitly closing our end of the stdout/stderr pipes.

Signed-off-by: Patrick Roy <[email protected]>
We asserted twice that the guest is not responsive anymore after
pausing, and zero times that it is responsive again after resuming. Fix
this by doing each once.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat requested review from zulinx86 and pb8o and removed request for zulinx86 December 11, 2024 15:54
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.98%. Comparing base (979cf1b) to head (cb3acd8).
Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4955   +/-   ##
=======================================
  Coverage   83.98%   83.98%           
=======================================
  Files         251      251           
  Lines       27889    27889           
=======================================
  Hits        23422    23422           
  Misses       4467     4467           
Flag Coverage Δ
5.10-c5n.metal 84.55% <ø> (+<0.01%) ⬆️
5.10-m5n.metal 84.53% <ø> (ø)
5.10-m6a.metal 83.82% <ø> (-0.01%) ⬇️
5.10-m6g.metal 80.68% <ø> (ø)
5.10-m6i.metal 84.52% <ø> (-0.01%) ⬇️
5.10-m7g.metal 80.68% <ø> (ø)
6.1-c5n.metal 84.55% <ø> (+<0.01%) ⬆️
6.1-m5n.metal 84.53% <ø> (+<0.01%) ⬆️
6.1-m6a.metal 83.82% <ø> (ø)
6.1-m6g.metal 80.67% <ø> (-0.01%) ⬇️
6.1-m6i.metal 84.52% <ø> (-0.01%) ⬇️
6.1-m7g.metal 80.68% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Dec 11, 2024
It's a fairly tight timeout, and I was hitting it quite frequently while
running functional tests with high parallelism. I don't really see any
reason why we should be concerned about this specific command simply
hitting the pytest timeout when things go wrong, so let's just remove
the timeout.

Signed-off-by: Patrick Roy <[email protected]>
Opportunistically set a 100s timeout. We don't have anything that should
run this long, and by manually setting a timeout we avoid hitting the
pytest timeout of 300s in case something _does_ go wrong (and hitting
the pytest timeout gives significantly less debug information, so a
manual timeout is preferred).

Signed-off-by: Patrick Roy <[email protected]>
Remove the logging from this function, because SSHConnection._exec
already does this.

Signed-off-by: Patrick Roy <[email protected]>
Instead of creating new SSH connections every time we want to run a
command inside the microvm, open a single daemonized ssh connection in
the constructor of `SSHConnection`, and reuse it until we kill the
microvm.

Realize this using openssh's ControlMaster/ControlPersist functionality,
which allows us to persist the first connection opened to a microvm and
multiplex all subsequent commands over this one connection.

We need some slight changes in test_pause_restore.py and test_net.py. In
the former, since we no longer reconnect to the VM on every ssh command,
we will now observe a timeout instead of a connection failure. For the
latter, it turns out that @lru_cache treats .ssh_iface() and
.ssh_iface(0) as distinct invokations, despire the iface_idx argument of
ssh_iface defaulting to 0 (meaning they are actually the same function
call). This caused a re-intialization of the SSHConnection object, which
then triggered the assertion in _init_connection about the socket file
not existing.

Signed-off-by: Patrick Roy <[email protected]>
With the ControlMaster/ControlPersist approach, these are no needed
anymore. I'm not reverting the commit that added them because that also
contained an update of poetry.lock. I'm also not rebuilding the devctr
for this because the dependencies will simply be dropped the next time
we're rebuilding it.

Signed-off-by: Patrick Roy <[email protected]>
Building the A/B binaries happens in a tmpdir, which we try to delete at
the end. But we create files in the tmpdir from a privileged docker
container, so to delete this tmpdir we also need elevated privileges

Signed-off-by: Patrick Roy <[email protected]>
tests/framework/microvm.py Show resolved Hide resolved
tests/host_tools/network.py Show resolved Hide resolved
@roypat roypat merged commit 8c7ee82 into firecracker-microvm:main Dec 12, 2024
6 of 7 checks passed
Manciukic added a commit to Manciukic/firecracker that referenced this pull request Dec 16, 2024
In firecracker-microvm#4955 the executable to check the ssh connection liveliness was
changed from `true` to `/usr/bin/true`, but that is not its path in all
rootfs, causing failures in the `test-populat-containers` suite.

Also, since the error is retried but the control socket is not cleaned
up, subsequent retries would fail for the assertion.

This change fixes both issues by using the binary name `true` and
cleaning up the control socket on error before the next retry.

Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm")
Signed-off-by: Riccardo Mancini <[email protected]>
Manciukic added a commit to Manciukic/firecracker that referenced this pull request Dec 16, 2024
In firecracker-microvm#4955 the executable to check the ssh connection liveliness was
changed from `true` to `/usr/bin/true`, but that is not its path in all
rootfs, causing failures in the `test-populat-containers` suite.

Also, since the error is retried but the daemon is not cleaned
up, subsequent retries would fail for the assertion.

This change fixes both issues by using the binary name `true` and
stopping the daemon on error before the next retry.

Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm")
Signed-off-by: Riccardo Mancini <[email protected]>
Manciukic added a commit to Manciukic/firecracker that referenced this pull request Dec 16, 2024
In firecracker-microvm#4955 the executable to check the ssh connection liveliness was
changed from `true` to `/usr/bin/true`, but that is not its path in all
rootfs, causing failures in the `test-populat-containers` suite.

Also, since the error is retried but the daemon is not cleaned
up, subsequent retries would fail for the assertion.

This change fixes both issues by using the binary name `true` and
stopping the daemon on error before the next retry.

Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm")
Signed-off-by: Riccardo Mancini <[email protected]>
roypat pushed a commit that referenced this pull request Dec 16, 2024
In #4955 the executable to check the ssh connection liveliness was
changed from `true` to `/usr/bin/true`, but that is not its path in all
rootfs, causing failures in the `test-populat-containers` suite.

Also, since the error is retried but the daemon is not cleaned
up, subsequent retries would fail for the assertion.

This change fixes both issues by using the binary name `true` and
stopping the daemon on error before the next retry.

Fixes: 3b2c2d4 ("test: use single SSH connection for lifetime of microvm")
Signed-off-by: Riccardo Mancini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants