Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource optimisation #82

Merged
merged 50 commits into from
Dec 18, 2023
Merged

Resource optimisation #82

merged 50 commits into from
Dec 18, 2023

Conversation

muffato
Copy link
Member

@muffato muffato commented Nov 6, 2023

Closes #81 and solves the RUNLIMIT issue we found when running the full_test https://sangertreeoflife.slack.com/archives/C04DJMVBZGW/p1701900015503119

In this PR, I modify the resource requirements of all processes to match the actual usage. Where possible, I make those requirements a function of the input sizes, so that the pipeline can adapt to small and large inputs.

I'm using the same dataset as in sanger-tol/genomenote#92 : 10 genomes of increasing size, with 1 Hi-C library and 1 PacBio library each. Input sizes are not proportional so that I can observe how the resources depend on each of them.

  Fasta size (bytes) PacBio size (# reads) Hi-C (# reads)
GCA_939531405.1 13,824,461 1,546,435 955,654,834
GCA_937625935.1 26,683,271 189,202 980,890,138
GCA_951394315.1 58,010,196 1,965,084 704,258,466
GCA_947172415.1 118,858,594 799,796 87,833,110
GCA_910589235.2 232,212,321 1,586,931 727,465,652
GCA_949987625.1 417,566,504 2,211,570 705,705,280
GCA_946406115.1 810,357,340 1,872,695 842,629,084
GCA_963513935.1 1,803,897,959 7,338,871 3,305,634,916
GCA_951213105.1 3,609,437,155 1,121,856 3,127,898,040
GCA_946902985.2 9,152,113,672 1,537,548 886,707,886

At the same time, I decided to do a few logic optimisations:

  • SAMTOOLS_MERGE produces a sorted file, so no need to run SAMTOOLS_SORT afterwards
  • SAMTOOLS_MERGE can be skipped if there is a single file
  • To save IO, the whole MARKDUPLICATE sub-workflow can be implemented as a bash pipeline within a module

To support tuning the resources, I added some steps to extract the genome size (using the size of the Fasta file as a proxy) and the read counts. These numbers are recorded in the meta map and used in conf/base.config.

I don't use any of the process_* labels any more. Every process now has optimised resources.
I also introduced a helper function to grow the number of CPUs needed for a process in a logarithm fashion. This is to control the increase of the number of CPUs, especially as the multi-threading tends to decrease with a higher number of threads.
The function can then also be used in the memory requirement if the memory usage increases with the thread count.

Most formulas are either constant values or linear functions of the genome size or read count (though I always add a * task.attempt somewhere to cover reruns. The two more complex formulas are:

  • MINIMAP2_ALIGN: the memory seems to be the sum of a function of the genome size and a function of the read count
  • BWAMEM2_MEM: runtime seems to be a function of the logarithm of the genome size (and obviously the read count)
Metric Before After Improvement
Total memory requested (GB) 5,950.0 1,380.3 ÷4.3
Memory efficiency (used/requested, %) 13.4 64.2
Total memory allocated (GB-hours) 8,991.8 3,212.5 ÷2.8
Memory allocation efficiency (used/requested, %) 19.0 74.2
Total CPUs requested (n) 925.0 646.0 ÷1.4
CPU efficiency (used/requested, %) 70.3 80.2
Total CPU allocated (CPU-hours) 1,407.0 1,150.0 ÷1.2
CPU allocation efficiency (used/requested, %) 65.2 87.7
Job failures (%) 0.3 0.0

Detailed charts showing the memory/CPU/time used/requested for every process: before (PDF) after (PDF)

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@muffato muffato added the enhancement Improvement of the existing features label Nov 6, 2023
@muffato muffato added this to the 1.2.0 milestone Nov 6, 2023
@muffato muffato self-assigned this Nov 6, 2023
@muffato muffato linked an issue Nov 16, 2023 that may be closed by this pull request
@muffato muffato marked this pull request as ready for review November 24, 2023 21:08
@muffato muffato marked this pull request as draft December 5, 2023 09:15
@muffato
Copy link
Member Author

muffato commented Dec 6, 2023

I'll update this branch to include the changes merged through #68 before marking the branch as ready

@muffato muffato marked this pull request as ready for review December 9, 2023 09:18
@priyanka-surana
Copy link
Contributor

This looks good to me. I will start the full test to confirm everything works. Are you ready to merge this?

@muffato
Copy link
Member Author

muffato commented Dec 14, 2023

Yes I'm ready. There's nothing else I'm planning to add to this PR

@priyanka-surana
Copy link
Contributor

priyanka-surana commented Dec 16, 2023

The full test runs. Can we increase the resources for the following:

ALIGN_HIFI:CONVERT_STATS:SAMTOOLS_FLAGSTAT
ALIGN_HIC:SAMTOOLS_FASTQ
CRUMBLE
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_IDXSTATS
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_STATS

These failed and were restarted. This is a small dataset so everything should run through in one go.
Otherwise all good so I am approving this.

@muffato
Copy link
Member Author

muffato commented Dec 17, 2023

Can you share the trace file ? I'd like to see how much resources were used. I've also noticed that sometimes the first attempt fails despite having the right requirements, for instance because the machine had slow access to the network.

@muffato muffato mentioned this pull request Dec 18, 2023
9 tasks
@muffato
Copy link
Member Author

muffato commented Dec 18, 2023

Oki. The last 4 are all cases of "should have succeeded at the first attempt"

Process Resource given (1st attempt) Error raised (1st attempt) Resource used (2nd attempt)
ALIGN_HIC:SAMTOOLS_FASTQ 8 hours RUNLIMIT 11.5 min
CRUMBLE 3.5 hours RUNLIMIT 2 hours
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_IDXSTATS 30 min RUNLIMIT 1 min
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_STATS 4 hours RUNLIMIT 12.5 min

The first attempts above were given more than enough resources but failed. I think there must have been something wrong on the network / storage link of those nodes at that time.

I would keep those rules as they are, but I also want to introduce some monitoring of job failures, so that the nodes / network issues can be appropriately reported to ISG. I'll discuss that with Cibin.


ALIGN_HIFI:CONVERT_STATS:SAMTOOLS_FLAGSTAT is less clear:

  • The first attempt was given 250 MB and failed because of MEMLIMIT after 1 min 44 s. LSF reports it was using 279 MB at the time
  • The second attempt was given 500 MB and succeeded in 27.5 s, but only using 145 MB (according to Nextflow) or 186 MB (according to LSF).

Because LSF monitors the memory usage by polling, there's the risk that very short memory bursts are not always caught by LSF, or that short-ish jobs don't get their true memory usage reported.
In my tests, this process was occasionally killed for the same reason, but with no clear correlation to the input size (number of reads or genome length). I think I would leave the rule as it is for now. If it starts happening frequently in production, we can obviously bump it up.

@muffato muffato mentioned this pull request Dec 18, 2023
9 tasks
@muffato muffato merged commit 0f4e2a1 into dev Dec 18, 2023
6 checks passed
@muffato muffato deleted the resource_optimisation branch December 18, 2023 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the existing features
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Can't align PacBio/ONT reads on genomes > 4Gbp
2 participants