Resource optimisation #82

muffato · 2023-11-06T10:15:55Z

Closes #81 and solves the RUNLIMIT issue we found when running the full_test https://sangertreeoflife.slack.com/archives/C04DJMVBZGW/p1701900015503119

In this PR, I modify the resource requirements of all processes to match the actual usage. Where possible, I make those requirements a function of the input sizes, so that the pipeline can adapt to small and large inputs.

I'm using the same dataset as in sanger-tol/genomenote#92 : 10 genomes of increasing size, with 1 Hi-C library and 1 PacBio library each. Input sizes are not proportional so that I can observe how the resources depend on each of them.

	Fasta size (bytes)	PacBio size (# reads)	Hi-C (# reads)
GCA_939531405.1	13,824,461	1,546,435	955,654,834
GCA_937625935.1	26,683,271	189,202	980,890,138
GCA_951394315.1	58,010,196	1,965,084	704,258,466
GCA_947172415.1	118,858,594	799,796	87,833,110
GCA_910589235.2	232,212,321	1,586,931	727,465,652
GCA_949987625.1	417,566,504	2,211,570	705,705,280
GCA_946406115.1	810,357,340	1,872,695	842,629,084
GCA_963513935.1	1,803,897,959	7,338,871	3,305,634,916
GCA_951213105.1	3,609,437,155	1,121,856	3,127,898,040
GCA_946902985.2	9,152,113,672	1,537,548	886,707,886

At the same time, I decided to do a few logic optimisations:

SAMTOOLS_MERGE produces a sorted file, so no need to run SAMTOOLS_SORT afterwards
SAMTOOLS_MERGE can be skipped if there is a single file
To save IO, the whole MARKDUPLICATE sub-workflow can be implemented as a bash pipeline within a module

To support tuning the resources, I added some steps to extract the genome size (using the size of the Fasta file as a proxy) and the read counts. These numbers are recorded in the meta map and used in conf/base.config.

I don't use any of the process_* labels any more. Every process now has optimised resources.
I also introduced a helper function to grow the number of CPUs needed for a process in a logarithm fashion. This is to control the increase of the number of CPUs, especially as the multi-threading tends to decrease with a higher number of threads.
The function can then also be used in the memory requirement if the memory usage increases with the thread count.

Most formulas are either constant values or linear functions of the genome size or read count (though I always add a * task.attempt somewhere to cover reruns. The two more complex formulas are:

MINIMAP2_ALIGN: the memory seems to be the sum of a function of the genome size and a function of the read count
BWAMEM2_MEM: runtime seems to be a function of the logarithm of the genome size (and obviously the read count)

Metric	Before	After	Improvement
Total memory requested (GB)	5,950.0	1,380.3	÷4.3
Memory efficiency (used/requested, %)	13.4	64.2
Total memory allocated (GB-hours)	8,991.8	3,212.5	÷2.8
Memory allocation efficiency (used/requested, %)	19.0	74.2
Total CPUs requested (n)	925.0	646.0	÷1.4
CPU efficiency (used/requested, %)	70.3	80.2
Total CPU allocated (CPU-hours)	1,407.0	1,150.0	÷1.2
CPU allocation efficiency (used/requested, %)	65.2	87.7
Job failures (%)	0.3	0.0

Detailed charts showing the memory/CPU/time used/requested for every process: before (PDF) after (PDF)

PR checklist

This comment contains a description of changes (with reason).
If you've fixed a bug or added code that should be tested, add tests!
If you've added a new tool - have you followed the pipeline conventions in the contribution docs
Make sure your code lints (nf-core lint).
Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
Usage Documentation in docs/usage.md is updated.
Output Documentation in docs/output.md is updated.
CHANGELOG.md is updated.
README.md is updated (including new tool citations and authors/contributors).

muffato · 2023-12-06T13:45:00Z

I'll update this branch to include the changes merged through #68 before marking the branch as ready

Closes #81

…SF exit codes

…nt values !

…ated to the number of reads

…ith the logarithm of the genome size

Had to bump this samtools to 1.17 as the option is not available in earlier versions.

…manipulate

…in the UR field of the @sq headers

priyanka-surana · 2023-12-14T16:24:50Z

This looks good to me. I will start the full test to confirm everything works. Are you ready to merge this?

muffato · 2023-12-14T16:45:29Z

Yes I'm ready. There's nothing else I'm planning to add to this PR

priyanka-surana · 2023-12-16T14:29:32Z

The full test runs. Can we increase the resources for the following:

ALIGN_HIFI:CONVERT_STATS:SAMTOOLS_FLAGSTAT
ALIGN_HIC:SAMTOOLS_FASTQ
CRUMBLE
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_IDXSTATS
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_STATS

These failed and were restarted. This is a small dataset so everything should run through in one go.
Otherwise all good so I am approving this.

muffato · 2023-12-17T14:39:01Z

Can you share the trace file ? I'd like to see how much resources were used. I've also noticed that sometimes the first attempt fails despite having the right requirements, for instance because the machine had slow access to the network.

muffato · 2023-12-18T13:00:30Z

Oki. The last 4 are all cases of "should have succeeded at the first attempt"

Process	Resource given (1st attempt)	Error raised (1st attempt)	Resource used (2nd attempt)
ALIGN_HIC:SAMTOOLS_FASTQ	8 hours	RUNLIMIT	11.5 min
CRUMBLE	3.5 hours	RUNLIMIT	2 hours
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_IDXSTATS	30 min	RUNLIMIT	1 min
ALIGN_HIC:MARKDUP_STATS:CONVERT_STATS:SAMTOOLS_STATS	4 hours	RUNLIMIT	12.5 min

The first attempts above were given more than enough resources but failed. I think there must have been something wrong on the network / storage link of those nodes at that time.

I would keep those rules as they are, but I also want to introduce some monitoring of job failures, so that the nodes / network issues can be appropriately reported to ISG. I'll discuss that with Cibin.

ALIGN_HIFI:CONVERT_STATS:SAMTOOLS_FLAGSTAT is less clear:

The first attempt was given 250 MB and failed because of MEMLIMIT after 1 min 44 s. LSF reports it was using 279 MB at the time
The second attempt was given 500 MB and succeeded in 27.5 s, but only using 145 MB (according to Nextflow) or 186 MB (according to LSF).

Because LSF monitors the memory usage by polling, there's the risk that very short memory bursts are not always caught by LSF, or that short-ish jobs don't get their true memory usage reported.
In my tests, this process was occasionally killed for the same reason, but with no clear correlation to the input size (number of reads or genome length). I think I would leave the rule as it is for now. If it starts happening frequently in production, we can obviously bump it up.

muffato added the enhancement Improvement of the existing features label Nov 6, 2023

muffato added this to the 1.2.0 milestone Nov 6, 2023

muffato self-assigned this Nov 6, 2023

This was referenced Nov 10, 2023

Resource optimisation sanger-tol/genomenote#90

Merged

Resource optimisation sanger-tol/genomenote#92

Merged

muffato linked an issue Nov 16, 2023 that may be closed by this pull request

Can't align PacBio/ONT reads on genomes > 4Gbp #81

Closed

muffato marked this pull request as ready for review November 24, 2023 21:08

muffato marked this pull request as draft December 5, 2023 09:15

muffato force-pushed the resource_optimisation branch from 78d70fa to 9ded53a Compare December 6, 2023 00:09

muffato added 20 commits December 7, 2023 15:10

Collect the read counts from the input files

2109bed

Collect the size of the genome (file size is a good proxy)

e2e8ead

Simply use the whole base name to name the channel

11678fc

Deal with genomes > 4 Gbp

84caa10

Closes #81

Update the read count for SAMTOOLS_MERGE

8dca0f3

Some values are large and need 64 bits

8c67daf

Wrong information

e432d89

Updated the error codes to the latest template versions. Covers all L…

ae6942b

…SF exit codes

Estimate the resource requirements based on the size of the inputs

46e9910

bugfix: minimap2 uses the decimal system and understands floating poi…

5f2d786

…nt values !

The output of SAMTOOLS_MERGE is sorted

35a646d

Logically, SAMTOOLS_MERGE should happen in the calling sub-workflow

04b2660

Skip SAMTOOLS_MERGE if there is a single file

f9b2763

Explain why there is no SAMTOOLS_SORT

fb1cd6d

Replaced the markduplicate workflow by a single module / bash pipeline

c00eee6

The runtime of samtools sort depends on the number of reads

5c39de5

Updated requirements for SAMTOOLS_SORMADUP

ff9ff03

The MINIMAP2_ALIGN includes SAMTOOLS_SORT. Need some extra memory rel…

d54d46f

…ated to the number of reads

In my latest tests, it seems BWAMEM2_MEM memory usage is correlated w…

582bedd

…ith the logarithm of the genome size

I don't need samtools cat

d884d47

muffato added 16 commits December 7, 2023 15:20

Tell it's a meta map

eae3261

Indentation should be a multiple of 4

f211fb4

SAMTOOLS_FLAGSTAT may take more time

5b70604

typo

0c5e4f9

Increased runtime, just in case

c27dd30

Updated runtime and memory requirements

a2924b0

Need these fields in the debug output

b2f22a2

Like in the genome note pipeline, use the work directory instead of /tmp

cfac7c8

Had to bump this samtools to 1.17 as the option is not available in earlier versions.

Updated the settings for SORMADUP

9da9f3f

Updated the BWA_MEM memory requirement

5e15589

Increased the number of retries

bdbcbb4

The default memory settings work just fine and make things easier to …

3905219

…manipulate

Usage is very close to the trend line. Smaller bins work fine

c15d923

quay.io/ is now the default

4094724

Added optimised settings for crumble

3ef0477

Need to fake REF_PATH to force crumble to use the Fasta file defined …

f0c1ba4

…in the UR field of the @sq headers

muffato force-pushed the resource_optimisation branch from 9ded53a to 4094724 Compare December 8, 2023 22:31

muffato mentioned this pull request Dec 8, 2023

Combine steps with pipes to reduce disk footprint and make the pipeline faster #78

Closed

There is a difference for ONT, which I assume would be there for CLR too

79bcbbc

muffato marked this pull request as ready for review December 9, 2023 09:18

priyanka-surana approved these changes Dec 16, 2023

View reviewed changes

muffato mentioned this pull request Dec 18, 2023

Updated the docs before release #83

Merged

9 tasks

muffato mentioned this pull request Dec 18, 2023

Release 1.2 #84

Merged

9 tasks

muffato merged commit 0f4e2a1 into dev Dec 18, 2023
6 checks passed

muffato deleted the resource_optimisation branch December 18, 2023 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource optimisation #82

Resource optimisation #82

muffato commented Nov 6, 2023 •

edited

Loading

muffato commented Dec 6, 2023

priyanka-surana commented Dec 14, 2023

muffato commented Dec 14, 2023

priyanka-surana commented Dec 16, 2023 •

edited

Loading

muffato commented Dec 17, 2023

muffato commented Dec 18, 2023

Resource optimisation #82

Resource optimisation #82

Conversation

muffato commented Nov 6, 2023 • edited Loading

PR checklist

muffato commented Dec 6, 2023

priyanka-surana commented Dec 14, 2023

muffato commented Dec 14, 2023

priyanka-surana commented Dec 16, 2023 • edited Loading

muffato commented Dec 17, 2023

muffato commented Dec 18, 2023

muffato commented Nov 6, 2023 •

edited

Loading

priyanka-surana commented Dec 16, 2023 •

edited

Loading