-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
releasemanager (with EDP, JIRA) . memory used/consumed during run is not reclaimed / freed up after run #857
Comments
@clemensutschig we looked into this. What we detected is that the available memory is immediately consumed upon start, starting with 10 GB and quickly going up to 14 GB. We also see regular garbage collector activity and that Jenkins' underlying Hudson implementation keeps up to 5 M references to We do see regular suspension times induced by the GC, which is related to taking objects from the Eden Space into the Old Gen. I assume that our configuration has something to do with this, but it came as a recommendation from CloudBees, see opendevstack/ods-core#412. |
@s2oBCN @victorpablosceruelo @angelmp01 will the ongoing refactorings of the LeVA DocGen service affect memory consumption? |
hey @metmajer this is interesting - do we have a view what is stored in this extensionLists ... ? I took a look what's in the doc about this .. maybe here is our culprit .. .?! |
@clemensutschig yes, might make a difference, but could depend on the GC. Let's not take for granted what CloudBees recommended some time ago. The following article might be interesting: https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector |
Hi there. We are requesting access to the monitor tools you show in this issue. The more details we have, the more concise will be our answer. |
Here is an overview on the Jenkins configuration we are currently using: opendevstack/ods-core#412 (comment) |
hurray :)
|
@clemensutschig you're the first person that's happy about an OOM :) Are you able to achieve better insights? Thinking the the GC options might need a re-evaluation. |
I am trying to get better insights - and cleanup all sorts of things now in my branch - to see if that impacts anything .. what's interesting - I have turned on cleanup (thru build config) of failed builds .. and take a look below:
|
ok - more tries (only keeping 2 failed runs, which the belows are): restart: 3246m mem used 17:43: start SECOND no JIRA NO DOCS rm run (agent pods were removed) 17:59: start THIRD no JIRA NO DOCS rm run (agent pods were removed) 18:15: start FOURTH no JIRA NO DOCS rm run (agent pods were removed) I am trying to understand why this is steadly growing - and I really CANNOT .. . |
ps - in case someone curious - I added
to the odsOrchestrationStage - at the almost very root ... :( and still nothing changes |
ok - now I used max 3g (limits 3gi) - with 0,5 heap percentage and -Xmx1g. Top: 1,3g Java and overall pod memory at 1600m still growing ... from the logs:
|
VM flags (based on xms / xmx = 1g and metaspace 1g):
more diagnosis .. (I have one testcomponent Ods job that runs over and over again) vs a couple minutes later vs a couple mins later |
@clemensutschig I'd like to do an analysis to identify memory leaks, how can I get the heap dump file? Did you generate a heap dump? |
native memory tracking: 12:21 UTC 12:30 UTC I am now forcing a maxOfRam with 90% .. lets see where that brings us .. see below - class is going up reserved and also committed - although its the exact same job running, from the exact same commit!
the same 10 mins later:
10 mins later:
and now 20 mins later - something starts to collect
take a look at the class increase ... !!!! what's interesting: |
but unfortunately - we are using [Jenkins 2.289.1] :( |
running classloader stats:
with now another run (with shared lib)
and now withOUT shared lib import ´(just echo ...)
and now with shared lib import (and echo .... )
sooo this is a classloader reference leak - something hangs on to classloaders - and it reproduces W/O any shared lib - it just gets worse with shared libs ... :/ |
ok - some more testing with classloader stats after start (NO running jobs):
after running a naked job (NO shared lib - just echo)
wait a few mins ....
after running a job (with shared lib import and echo)
running the same job again
this is really scary ... :( |
I started to play with the heap dump - and classloaders .... and all big classloaders have the common ancestor: if I look into the classloader - this is definetely ours ... (with grapes being there ... ) but there is a second guy holding (the first is the finalizer of the vm that tries to clean it) - that starts to spark my interest ... (via incoming references) this is an enum .. hmmmm |
Hi, |
Run again the tests with the shared library both with and without the changes for the next release with standard jenkins master and agent images and compared the results, after the build end memory get freed thanks to Clemens clean up code. |
@oalyman - here is the two nice commands bsaeline:
and to track metrics:
and what's helpful - is to trigger gc as well - with |
I stumbled across a few things though while looking thru the heap dump (with the dangling loaders) - and that sort of brings me back to my original thoughts (on it must be something else then green path EDP rm execution)
So there is two HOT topics that are the smokers ... (a) the component pipeline and (b) the component pipeline ESPECIALLY in the context of a [ci ckip] - which is triggered from the RM commit(s) Quickly verified - in neither of the two cases the cleanup code is called (and I would also guess that it's incomplete in any case)
I will provide a fix for this in a branch for people to try it out - now that we can repro it :) (it's reoccuring promote to dev builds - that create all sorts of commits, which cause downstream CI skip runs) .. |
with the fix I see less class etc growth ... which is good .. interestingly (at full swing, and close to class space limit) - I see a
|
what's kind of interesting is how long pods stay alive (Even after jenkins run is done and gone) - which may be the reason for the thread mem size growing .. and secondly - although jvm mem is pretty stable - I still see pod mem consumption growing .. @oalyman - maybe you can take a look into that .. |
Courtesy @oalyman - simple is beautifull. Rerunning my testsuite. One thing that people that debug / test with it - should also add is |
The settings @clemensutschig mentioned above originate from
Which refers to |
latest settings based on jan's - I added |
That's kind of interesting - looks like most collectors dont give back to the OS (at least not a lot) .. :( The other piece is to figure out if we leak somewhere else - as documented here .. https://stackoverflow.com/questions/64593968/why-does-res-memory-keep-slowly-growing-for-java-processes-even-for-out-of-the-b I leave my testcase running now overnight - to see what's going on .. and if native memory does increase. |
|
so news from the night .. I let the testcase run overnight .. and the VM is super stable (with heap growth)
|
We deleted all the "pending" CI-skip builds that are triggered because of the commits (assemble to dev).. immediately heap is reduced by ~350MB which also transpires in OC container mem metrics to go down by 200m?! ... interesting |
Here is a pretty comprehensive source of information of the memory footprint of the JVM. |
Here is a summary of the current state:
To investigate why OpenShfit reports a larger memory usage than anything which can be observed in the container (+50-80%). |
With xmx2g & malloc_arena=4 and various code fixes (shared lib and ods core one)
|
@oalyman @albertpuente @clemensutschig this is huge, absolutely insane. fantastic job! |
A did a bit more digging into the difference between the memory used by java and what Openshift reports. The metric we see in Openshift corresponds to container_memory_working_set_bytes, and it is what the OOM killer looks for (source). This metric is understood as a heuristic for the minimum size of memory the app needs to work and corresponds to: This pretty much matches what I see in my example: $ oc exec jenkins-mem-test-6-gg8tt -- cat /sys/fs/cgroup/memory/memory.usage_in_bytes
2535804928
$ oc exec jenkins-mem-test-6-gg8tt -- cat /sys/fs/cgroup/memory/memory.stat | grep inactive_file
inactive_file 334565376
total_inactive_file 334565376
$ expr 2535804928 - 334565376
2201239552
# This is 2099,265625 Gi
$ kubectl top pod jenkins-mem-test-6-gg8tt # Same as seen in OpenShift metrics
NAME CPU(cores) MEMORY(bytes)
jenkins-mem-test-6-gg8tt 48m 2099Mi Now, |
the questions is if this will ever be reclaimed,.. or does that indicate a file leak, or something? sorry for the stupid beginner question |
as we are using a LOT of jar and zzip files (grape! and doc gen) - maybe https://stackoverflow.com/questions/44273737/how-to-use-ulimit-with-java-correctly and we have to go |
ok the But I found yet another jenkins bug (changeset related) - for some reason the "dummy" commits in a test component dont show up - so these builds will dangle around for sure (one build following every RM build) :( |
This is already set by default if one does not set |
if I read into https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d?gi=f77e14a09937 we may be now really on the save side ... I will try to force it to limits and see whats happening ... :D |
We run jenkins with EDP - for document generation - and can see that memory used during the run is not reclaimed / freed up after
the run started at 5:44pm - and was done at 11pm CET - you can see - no GC // reclaim
@braisvq1996 @albertpuente
The text was updated successfully, but these errors were encountered: