-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use balloon statistics in the test that checks balloon deflates on OOM #4150
Conversation
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4150 +/- ##
=======================================
Coverage 83.10% 83.10%
=======================================
Files 225 225
Lines 28605 28605
=======================================
Hits 23772 23772
Misses 4833 4833
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
2f52798
to
2b70376
Compare
9568a20
to
893e2e1
Compare
893e2e1
to
48534eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a couple minor comments.
Balloon devices have a feature where they can start deflating when the guest is in an OOM situation. We have a test that ensures this functionality works as expected. The test creates a microVM with a balloon device enabled, it inflates the balloon and then invokes a process in the microVM that exhausts the remaining microVM memory. The expectation is that the OOM killer will kick in and reap that process. The test relies on observing the process that fills up the memory to be killed in order to succeed. However, we do not really have control on what process the OOM will decide to kill, in low memory situations. This makes the test failing intermittently. This commit, changes the test to instead look into balloon statistics. Conceptually this makes sense; we don't want to test the OOM killer functionality, we want to ensure that the balloon device gives back memory to the VM in low memory situations. The balloon statistics can give us this information. Signed-off-by: Babis Chalios <[email protected]>
In test_balloon.py we are trying to find the PID of the SSH daemon process, so that we later change its OOM score, so that it does not get killed when we (deliberately) exhaust the available memory in a microVM. We use "pidof sshd" to do that. However, pidof returns the PIDs of all threads of the process. Now sshd launches a new thread for every new SSH connection. Later when we iterate over these PIDs to change their OOM score not all of the threads might be there, so the choom will fail for these. This is not really a problem, but it some times leads to misleading error messages. This commit drops the use of "pidof" in favour of reading the daemon's PID from "/run/sshd.pid". Signed-off-by: Babis Chalios <[email protected]>
In test_balloon.py::test_deflate_on_oom we are exhausting the memory of the microVM trying to trigger the OOM killer. This commit removes SSH commands after launching the memory hogger inside the microVM, to avoid hang connections due to the OOM killer killing sshd. Signed-off-by: Babis Chalios <[email protected]>
We have a mechanism that allows us to run a command inside a microVM. This mechanism ends up using Popen.communicate to retrieve the output of the SSH command. Popen.communicate comes with a timeout variable that we were not using. However, it is useful in cases where we don't want to wait for the result of the command we execute in the microVM. This commit extends our SSH mechanism to accept the timeout argument. Then, it uses the timeout in the test_balloon.py::test_deflate_on_oom when launching the fillmem process. fillmem drains the memory of the VM, which sometimes results in the SSH connection hanging. Signed-off-by: Babis Chalios <[email protected]>
0095e0e
to
57ed015
Compare
Reason
Balloon devices have a feature where they can start deflating when the guest is in an OOM situation. We have a test that ensures this functionality works as expected. The test creates a microVM with a balloon device enabled, it inflates the balloon and then invokes a process in the microVM that exhausts the remaining microVM memory. The expectation is that the OOM killer will kick in and reap that process. The test relies on observing the process that fills up the memory to be killed in order to succeed.
However, we do not really have control on what process the OOM will decide to kill, in low memory situations. This makes the test failing intermittently.
Changes
This PR, changes the test to instead look into balloon statistics. Conceptually this makes sense; we don't want to test the OOM killer functionality, we want to ensure that the balloon device gives back memory to the VM in low memory situations. The balloon statistics can give us this information.
By doing that, the test always passes when we configure the balloon device to deflate when the guest is in OOM conditions. However, the test is still flaky when we run it with the deflate on OOM option disabled. The reason is that some times the SSH command we run to spawn the guest process that drains memory hangs. Probably the OOM killer chooses the thread that handles the SSH connection. So, the PR also adds a timeout option in the function that we have to run commands over SSH inside the guest.
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
CHANGELOG.md
.TODO
s link to an issue.rust-vmm
.