Use balloon statistics in the test that checks balloon deflates on OOM #4150

bchalios · 2023-10-05T10:46:36Z

Reason

Balloon devices have a feature where they can start deflating when the guest is in an OOM situation. We have a test that ensures this functionality works as expected. The test creates a microVM with a balloon device enabled, it inflates the balloon and then invokes a process in the microVM that exhausts the remaining microVM memory. The expectation is that the OOM killer will kick in and reap that process. The test relies on observing the process that fills up the memory to be killed in order to succeed.

However, we do not really have control on what process the OOM will decide to kill, in low memory situations. This makes the test failing intermittently.

Changes

This PR, changes the test to instead look into balloon statistics. Conceptually this makes sense; we don't want to test the OOM killer functionality, we want to ensure that the balloon device gives back memory to the VM in low memory situations. The balloon statistics can give us this information.

By doing that, the test always passes when we configure the balloon device to deflate when the guest is in OOM conditions. However, the test is still flaky when we run it with the deflate on OOM option disabled. The reason is that some times the SSH command we run to spawn the guest process that drains memory hangs. Probably the OOM killer chooses the thread that handles the SSH connection. So, the PR also adds a timeout option in the function that we have to run commands over SSH inside the guest.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2023-10-05T10:55:25Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (63395be) 83.10% compared to head (57ed015) 83.10%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4150   +/-   ##
=======================================
  Coverage   83.10%   83.10%           
=======================================
  Files         225      225           
  Lines       28605    28605           
=======================================
  Hits        23772    23772           
  Misses       4833     4833

Flag	Coverage Δ
4.14-c7g.metal	`78.67% <ø> (+<0.01%)`	⬆️
4.14-m5d.metal	`80.47% <ø> (ø)`
4.14-m6a.metal	`79.61% <ø> (ø)`
4.14-m6g.metal	`78.67% <ø> (ø)`
4.14-m6i.metal	`80.45% <ø> (+<0.01%)`	⬆️
5.10-c7g.metal	`81.57% <ø> (ø)`
5.10-m5d.metal	`83.13% <ø> (ø)`
5.10-m6a.metal	`82.37% <ø> (ø)`
5.10-m6g.metal	`81.57% <ø> (ø)`
5.10-m6i.metal	`83.11% <ø> (+<0.01%)`	⬆️
6.1-c7g.metal	`81.57% <ø> (ø)`
6.1-m5d.metal	`83.13% <ø> (ø)`
6.1-m6a.metal	`82.37% <ø> (ø)`
6.1-m6g.metal	`81.57% <ø> (ø)`
6.1-m6i.metal	`83.11% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/integration_tests/functional/test_balloon.py

pb8o

Looks great, just a couple minor comments.

tests/integration_tests/functional/test_balloon.py

tests/framework/utils.py

Balloon devices have a feature where they can start deflating when the guest is in an OOM situation. We have a test that ensures this functionality works as expected. The test creates a microVM with a balloon device enabled, it inflates the balloon and then invokes a process in the microVM that exhausts the remaining microVM memory. The expectation is that the OOM killer will kick in and reap that process. The test relies on observing the process that fills up the memory to be killed in order to succeed. However, we do not really have control on what process the OOM will decide to kill, in low memory situations. This makes the test failing intermittently. This commit, changes the test to instead look into balloon statistics. Conceptually this makes sense; we don't want to test the OOM killer functionality, we want to ensure that the balloon device gives back memory to the VM in low memory situations. The balloon statistics can give us this information. Signed-off-by: Babis Chalios <[email protected]>

In test_balloon.py we are trying to find the PID of the SSH daemon process, so that we later change its OOM score, so that it does not get killed when we (deliberately) exhaust the available memory in a microVM. We use "pidof sshd" to do that. However, pidof returns the PIDs of all threads of the process. Now sshd launches a new thread for every new SSH connection. Later when we iterate over these PIDs to change their OOM score not all of the threads might be there, so the choom will fail for these. This is not really a problem, but it some times leads to misleading error messages. This commit drops the use of "pidof" in favour of reading the daemon's PID from "/run/sshd.pid". Signed-off-by: Babis Chalios <[email protected]>

In test_balloon.py::test_deflate_on_oom we are exhausting the memory of the microVM trying to trigger the OOM killer. This commit removes SSH commands after launching the memory hogger inside the microVM, to avoid hang connections due to the OOM killer killing sshd. Signed-off-by: Babis Chalios <[email protected]>

We have a mechanism that allows us to run a command inside a microVM. This mechanism ends up using Popen.communicate to retrieve the output of the SSH command. Popen.communicate comes with a timeout variable that we were not using. However, it is useful in cases where we don't want to wait for the result of the command we execute in the microVM. This commit extends our SSH mechanism to accept the timeout argument. Then, it uses the timeout in the test_balloon.py::test_deflate_on_oom when launching the fillmem process. fillmem drains the memory of the VM, which sometimes results in the SSH connection hanging. Signed-off-by: Babis Chalios <[email protected]>

bchalios self-assigned this Oct 5, 2023

bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 5, 2023

bchalios changed the title ~~Use ballon statistics in the test that checks balloon deflates on OOM~~ Use balloon statistics in the test that checks balloon deflates on OOM Oct 5, 2023

bchalios force-pushed the fix_test_balloon branch from 2f52798 to 2b70376 Compare October 5, 2023 11:07

zulinx86 previously approved these changes Oct 5, 2023

View reviewed changes

bchalios marked this pull request as draft October 5, 2023 11:11

bchalios dismissed zulinx86’s stale review via 8231db2 October 5, 2023 11:16

bchalios force-pushed the fix_test_balloon branch 5 times, most recently from 9568a20 to 893e2e1 Compare October 5, 2023 15:31

bchalios marked this pull request as ready for review October 5, 2023 15:31

bchalios force-pushed the fix_test_balloon branch from 893e2e1 to 48534eb Compare October 5, 2023 15:44

bchalios requested a review from zulinx86 October 5, 2023 15:49

zulinx86 previously approved these changes Oct 5, 2023

View reviewed changes

roypat previously approved these changes Oct 6, 2023

View reviewed changes

tests/integration_tests/functional/test_balloon.py Show resolved Hide resolved

pb8o previously approved these changes Oct 6, 2023

View reviewed changes

tests/integration_tests/functional/test_balloon.py Outdated Show resolved Hide resolved

tests/framework/utils.py Outdated Show resolved Hide resolved

bchalios added 4 commits October 6, 2023 11:00

bchalios dismissed stale reviews from pb8o, roypat, and zulinx86 via 57ed015 October 6, 2023 11:11

bchalios force-pushed the fix_test_balloon branch from 0095e0e to 57ed015 Compare October 6, 2023 11:11

bchalios requested review from pb8o, roypat and zulinx86 October 6, 2023 11:12

roypat approved these changes Oct 6, 2023

View reviewed changes

pb8o approved these changes Oct 6, 2023

View reviewed changes

pb8o merged commit b4871b1 into firecracker-microvm:main Oct 6, 2023
5 checks passed

bchalios deleted the fix_test_balloon branch October 6, 2023 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use balloon statistics in the test that checks balloon deflates on OOM #4150

Use balloon statistics in the test that checks balloon deflates on OOM #4150

bchalios commented Oct 5, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading

pb8o left a comment

Use balloon statistics in the test that checks balloon deflates on OOM #4150

Use balloon statistics in the test that checks balloon deflates on OOM #4150

Conversation

bchalios commented Oct 5, 2023 • edited Loading

Reason

Changes

License Acceptance

PR Checklist

codecov bot commented Oct 5, 2023 • edited Loading

Codecov Report

pb8o left a comment

Choose a reason for hiding this comment

bchalios commented Oct 5, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading