Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use balloon statistics in the test that checks balloon deflates on OOM #4150

Merged
merged 4 commits into from
Oct 6, 2023

Conversation

bchalios
Copy link
Contributor

@bchalios bchalios commented Oct 5, 2023

Reason

Balloon devices have a feature where they can start deflating when the guest is in an OOM situation. We have a test that ensures this functionality works as expected. The test creates a microVM with a balloon device enabled, it inflates the balloon and then invokes a process in the microVM that exhausts the remaining microVM memory. The expectation is that the OOM killer will kick in and reap that process. The test relies on observing the process that fills up the memory to be killed in order to succeed.

However, we do not really have control on what process the OOM will decide to kill, in low memory situations. This makes the test failing intermittently.

Changes

This PR, changes the test to instead look into balloon statistics. Conceptually this makes sense; we don't want to test the OOM killer functionality, we want to ensure that the balloon device gives back memory to the VM in low memory situations. The balloon statistics can give us this information.

By doing that, the test always passes when we configure the balloon device to deflate when the guest is in OOM conditions. However, the test is still flaky when we run it with the deflate on OOM option disabled. The reason is that some times the SSH command we run to spawn the guest process that drains memory hangs. Probably the OOM killer chooses the thread that handles the SSH connection. So, the PR also adds a timeout option in the function that we have to run commands over SSH inside the guest.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (63395be) 83.10% compared to head (57ed015) 83.10%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4150   +/-   ##
=======================================
  Coverage   83.10%   83.10%           
=======================================
  Files         225      225           
  Lines       28605    28605           
=======================================
  Hits        23772    23772           
  Misses       4833     4833           
Flag Coverage Δ
4.14-c7g.metal 78.67% <ø> (+<0.01%) ⬆️
4.14-m5d.metal 80.47% <ø> (ø)
4.14-m6a.metal 79.61% <ø> (ø)
4.14-m6g.metal 78.67% <ø> (ø)
4.14-m6i.metal 80.45% <ø> (+<0.01%) ⬆️
5.10-c7g.metal 81.57% <ø> (ø)
5.10-m5d.metal 83.13% <ø> (ø)
5.10-m6a.metal 82.37% <ø> (ø)
5.10-m6g.metal 81.57% <ø> (ø)
5.10-m6i.metal 83.11% <ø> (+<0.01%) ⬆️
6.1-c7g.metal 81.57% <ø> (ø)
6.1-m5d.metal 83.13% <ø> (ø)
6.1-m6a.metal 82.37% <ø> (ø)
6.1-m6g.metal 81.57% <ø> (ø)
6.1-m6i.metal 83.11% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bchalios bchalios self-assigned this Oct 5, 2023
@bchalios bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 5, 2023
@bchalios bchalios changed the title Use ballon statistics in the test that checks balloon deflates on OOM Use balloon statistics in the test that checks balloon deflates on OOM Oct 5, 2023
zulinx86
zulinx86 previously approved these changes Oct 5, 2023
@bchalios bchalios marked this pull request as draft October 5, 2023 11:11
@bchalios bchalios force-pushed the fix_test_balloon branch 5 times, most recently from 9568a20 to 893e2e1 Compare October 5, 2023 15:31
@bchalios bchalios marked this pull request as ready for review October 5, 2023 15:31
@bchalios bchalios requested a review from zulinx86 October 5, 2023 15:49
zulinx86
zulinx86 previously approved these changes Oct 5, 2023
roypat
roypat previously approved these changes Oct 6, 2023
pb8o
pb8o previously approved these changes Oct 6, 2023
Copy link
Contributor

@pb8o pb8o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a couple minor comments.

tests/integration_tests/functional/test_balloon.py Outdated Show resolved Hide resolved
tests/framework/utils.py Outdated Show resolved Hide resolved
Balloon devices have a feature where they can start deflating when the
guest is in an OOM situation. We have a test that ensures this
functionality works as expected. The test creates a microVM with a
balloon device enabled, it inflates the balloon and then invokes a
process in the microVM that exhausts the remaining microVM memory. The
expectation is that the OOM killer will kick in and reap that process.
The test relies on observing the process that fills up the memory to be
killed in order to succeed.

However, we do not really have control on what process the OOM will
decide to kill, in low memory situations. This makes the test failing
intermittently.

This commit, changes the test to instead look into balloon statistics.
Conceptually this makes sense; we don't want to test the OOM killer
functionality, we want to ensure that the balloon device gives back
memory to the VM in low memory situations. The balloon statistics can
give us this information.

Signed-off-by: Babis Chalios <[email protected]>
In test_balloon.py we are trying to find the PID of the SSH daemon
process, so that we later change its OOM score, so that it does not get
killed when we (deliberately) exhaust the available memory in a microVM.
We use "pidof sshd" to do that. However, pidof returns the PIDs of all
threads of the process. Now sshd launches a new thread for every new SSH
connection. Later when we iterate over these PIDs to change their OOM
score not all of the threads might be there, so the choom will fail for
these. This is not really a problem, but it some times leads to
misleading error messages.

This commit drops the use of "pidof" in favour of reading the daemon's
PID from "/run/sshd.pid".

Signed-off-by: Babis Chalios <[email protected]>
In test_balloon.py::test_deflate_on_oom we are exhausting the memory of
the microVM trying to trigger the OOM killer. This commit removes SSH
commands after launching the memory hogger inside the microVM, to avoid
hang connections due to the OOM killer killing sshd.

Signed-off-by: Babis Chalios <[email protected]>
We have a mechanism that allows us to run a command inside a microVM.
This mechanism ends up using Popen.communicate to retrieve the output of
the SSH command. Popen.communicate comes with a timeout variable that we
were not using. However, it is useful in cases where we don't want to
wait for the result of the command we execute in the microVM.

This commit extends our SSH mechanism to accept the timeout argument.
Then, it uses the timeout in the test_balloon.py::test_deflate_on_oom
when launching the fillmem process. fillmem drains the memory of the VM,
which sometimes results in the SSH connection hanging.

Signed-off-by: Babis Chalios <[email protected]>
@bchalios bchalios dismissed stale reviews from pb8o, roypat, and zulinx86 via 57ed015 October 6, 2023 11:11
@bchalios bchalios requested review from pb8o, roypat and zulinx86 October 6, 2023 11:12
@pb8o pb8o merged commit b4871b1 into firecracker-microvm:main Oct 6, 2023
5 checks passed
@bchalios bchalios deleted the fix_test_balloon branch October 6, 2023 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants