-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource optimisation #82
Conversation
78d70fa
to
9ded53a
Compare
I'll update this branch to include the changes merged through #68 before marking the branch as ready |
Had to bump this samtools to 1.17 as the option is not available in earlier versions.
…in the UR field of the @sq headers
9ded53a
to
4094724
Compare
This looks good to me. I will start the full test to confirm everything works. Are you ready to merge this? |
Yes I'm ready. There's nothing else I'm planning to add to this PR |
The full test runs. Can we increase the resources for the following:
These failed and were restarted. This is a small dataset so everything should run through in one go. |
Can you share the trace file ? I'd like to see how much resources were used. I've also noticed that sometimes the first attempt fails despite having the right requirements, for instance because the machine had slow access to the network. |
Oki. The last 4 are all cases of "should have succeeded at the first attempt"
The first attempts above were given more than enough resources but failed. I think there must have been something wrong on the network / storage link of those nodes at that time. I would keep those rules as they are, but I also want to introduce some monitoring of job failures, so that the nodes / network issues can be appropriately reported to ISG. I'll discuss that with Cibin. ALIGN_HIFI:CONVERT_STATS:SAMTOOLS_FLAGSTAT is less clear:
Because LSF monitors the memory usage by polling, there's the risk that very short memory bursts are not always caught by LSF, or that short-ish jobs don't get their true memory usage reported. |
Closes #81 and solves the RUNLIMIT issue we found when running the full_test https://sangertreeoflife.slack.com/archives/C04DJMVBZGW/p1701900015503119
In this PR, I modify the resource requirements of all processes to match the actual usage. Where possible, I make those requirements a function of the input sizes, so that the pipeline can adapt to small and large inputs.
I'm using the same dataset as in sanger-tol/genomenote#92 : 10 genomes of increasing size, with 1 Hi-C library and 1 PacBio library each. Input sizes are not proportional so that I can observe how the resources depend on each of them.
At the same time, I decided to do a few logic optimisations:
To support tuning the resources, I added some steps to extract the genome size (using the size of the Fasta file as a proxy) and the read counts. These numbers are recorded in the
meta
map and used inconf/base.config
.I don't use any of the
process_*
labels any more. Every process now has optimised resources.I also introduced a helper function to grow the number of CPUs needed for a process in a logarithm fashion. This is to control the increase of the number of CPUs, especially as the multi-threading tends to decrease with a higher number of threads.
The function can then also be used in the memory requirement if the memory usage increases with the thread count.
Most formulas are either constant values or linear functions of the genome size or read count (though I always add a
* task.attempt
somewhere to cover reruns. The two more complex formulas are:Detailed charts showing the memory/CPU/time used/requested for every process: before (PDF) after (PDF)
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).