simplify TABIX generation + refactor #48

maxulysse · 2024-12-02T10:04:31Z

Simplifying handling of vcf files:

I think we only need to need 1 single TABIX_TABIX process and generate any necessary tbi from it

While doing that I did some code refactoring to properly handle case when input contains only vcfs or only fastas and tools that would need these assets were failing.

What the refactoring actually did, and what it implies:

So as this is just for v1 of references (ie updating igenomes), I work with a very bad versions of the assets files we envision for v2 (aka igenomes reboot).
So with that end-product in mind, my plan is to be able from assets that describe what we currently have in igenomes to generate what is missing.

We currently have a very permissive setup for the assets file, for which what is required is just the meta data (ie genome, species, build).
Which mean we can index a single vcf without caring about the fasta (because we shouldn't).
Or we can build intervals from a fasta.fai without even caring about the fasta (just because we can).

And that was actually already possible before this code refactoring.

What I did in this refactoring was simplify vcf index generation by splitting the input, so instead of 4 different process for handling just 4 types of vcf, we have now a single process to index any vcf.

I moved the input handling in its own subworkflow just to simplify the readability of the workflow.
And replaced the multiMap wih multiple maps as I noticed some issues with the multiMap operator.
Plan is to rework that done the line.

I grouped together into badly named subworkflows some modules that made send to be grouped together based on what they were doing and what files they need, so that the workflow is nicer.
Plan for this is definitively open and there will probably be some rewrite done the line too.

PR checklist

pinin4fjords

Not completely following the refactoring, but in general:

Could we have comments on inputs and outputs to indicate expected structures/ formats? Seem to have got lost in the refactoring.
Can we name subworkflows in a more consistent/ useful way?
Maybe add a comment to each subworkflow explaining its purpose?

subworkflows/local/samplesheet_to_channel/main.nf

subworkflows/local/index_vcf/main.nf

subworkflows/local/create_align_index/main.nf

pinin4fjords · 2024-12-03T12:50:32Z

subworkflows/local/samplesheet_to_channel/main.nf

+
+    main:
+
+    intervals_bed = reference.map { meta, input_intervals_bed, input_fasta, input_fasta_dict, input_fasta_fai, input_fasta_sizes, input_gff, input_gtf, input_splice_sites, input_transcript_fasta, input_vcf, input_readme, input_bed12, input_mito_name, input_macs_gsize ->


wouldn't a multimap save the repetition of the map operation?

Currently multiMap does not allow for outputing null

Maybe output a string 'NULL' or something and then have a map to switch it out to a NULL in the emits?

Imagine having to fix all of these if the input channel changes in structure.

that's what I had before, and I had to remap every single channel, so I'd rather just use multiple maps for now rather than multiMap + multiple maps.
I'll make an MRE for Ben and others to solve

subworkflows/local/create_align_index_with_gff/main.nf

adamrtalbot

The logic I do follow is sound, but there's a lot here and I don't quite follow what is going on.

adamrtalbot · 2024-12-04T09:08:14Z

subworkflows/local/samplesheet_to_channel/main.nf

@@ -0,0 +1,60 @@
+workflow SAMPLESHEET_TO_CHANNEL {
+    take:
+    reference // channel: [meta, intervals_bed, fasta, fasta_dict, fasta_fai, fasta_sizes, gff, gtf, splice_sites, transcript_fasta, vcf, readme, bed12, mito_name, macs_gsize]


You could use a channel of val(map) here:

[ meta: metamap intervals_bed: file(intervals_bed) ... ]

Then convert it to a normal tuple before use with a map. This will be similar to Ben's typing implementation that is coming.

myMap.map { myMap -> [ myMap.meta, myMap.intervals_bed ]

So happy to improve the input handling later on, but for now this is working, I'd rather not change it now

adamrtalbot · 2024-12-04T09:09:58Z

subworkflows/local/samplesheet_to_channel/main.nf

+        return input_intervals_bed ? [meta, input_intervals_bed] : null
+    }
+
+    fasta = reference.map { meta, input_intervals_bed, input_fasta, input_fasta_dict, input_fasta_fai, input_fasta_sizes, input_gff, input_gtf, input_splice_sites, input_transcript_fasta, input_vcf, input_readme, input_bed12, input_mito_name, input_macs_gsize ->


Maybe we should use a custom class like Jason Fan demonstrated at the Boston summit...

adamrtalbot

I find it very hard to follow what's going on here, so I'm going to trust you here.

Could you update the PR description to describe what the refactoring is?

maxulysse added 9 commits December 2, 2024 11:04

simplify TABIX generation

19e3c94

update CHANGELOG

f7aa2c1

Fix input fasta non existing

537e2ab

fix typo

5761521

Merge branch 'dev' into simplify_vcf

7ccf815

Fix input fasta channel when non existing fasta

5f3d3d9

typo

e7f9c6f

fix typo

d0940b8

refactor

53fb552

maxulysse changed the title ~~simplify TABIX generation~~ simplify TABIX generation + refacroe Dec 3, 2024

maxulysse changed the title ~~simplify TABIX generation + refacroe~~ simplify TABIX generation + refactor Dec 3, 2024

maxulysse added 12 commits December 3, 2024 09:41

better gff / gtf handling

ac8d23b

restore output

3d65e66

fix vcf inputs

6d52079

better handling of fasta_transcript

61b06ac

forgot versions

d058e87

better handling of splice_sites

3e8c72c

code polish

0b1a20a

typo

2f82928

handle existing fai, sizes and intervals

afa29c1

update CHANGELOG

53cf72c

code polish

c8ffd85

fix handling of existing fai

a167a5c

pinin4fjords reviewed Dec 3, 2024

View reviewed changes

better comments

7108fda

pinin4fjords reviewed Dec 3, 2024

View reviewed changes

subworkflows/local/create_align_index_with_gff/main.nf Outdated Show resolved Hide resolved

maxulysse and others added 4 commits December 3, 2024 16:01

Merge branch 'nf-core:dev' into simplify_vcf

c164baa

Merge branch 'dev' into simplify_vcf

f1817ae

document output structure

50e6c43

badly merge conflicts

1b6bd6b

maxulysse added 2 commits December 3, 2024 19:12

code polish

4974124

fix typo

95949e4

adamrtalbot reviewed Dec 4, 2024

View reviewed changes

adamrtalbot approved these changes Dec 4, 2024

View reviewed changes

maxulysse merged commit ec829e0 into nf-core:dev Dec 4, 2024
18 checks passed

maxulysse deleted the simplify_vcf branch December 4, 2024 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplify TABIX generation + refactor #48

simplify TABIX generation + refactor #48

maxulysse commented Dec 2, 2024 •

edited

Loading

pinin4fjords left a comment

pinin4fjords Dec 3, 2024

maxulysse Dec 3, 2024

pinin4fjords Dec 3, 2024

maxulysse Dec 4, 2024

adamrtalbot left a comment

adamrtalbot Dec 4, 2024

maxulysse Dec 4, 2024

adamrtalbot Dec 4, 2024

adamrtalbot left a comment


		main:

		intervals_bed = reference.map { meta, input_intervals_bed, input_fasta, input_fasta_dict, input_fasta_fai, input_fasta_sizes, input_gff, input_gtf, input_splice_sites, input_transcript_fasta, input_vcf, input_readme, input_bed12, input_mito_name, input_macs_gsize ->

simplify TABIX generation + refactor #48

simplify TABIX generation + refactor #48

Conversation

maxulysse commented Dec 2, 2024 • edited Loading

PR checklist

pinin4fjords left a comment

Choose a reason for hiding this comment

pinin4fjords Dec 3, 2024

Choose a reason for hiding this comment

maxulysse Dec 3, 2024

Choose a reason for hiding this comment

pinin4fjords Dec 3, 2024

Choose a reason for hiding this comment

maxulysse Dec 4, 2024

Choose a reason for hiding this comment

adamrtalbot left a comment

Choose a reason for hiding this comment

adamrtalbot Dec 4, 2024

Choose a reason for hiding this comment

maxulysse Dec 4, 2024

Choose a reason for hiding this comment

adamrtalbot Dec 4, 2024

Choose a reason for hiding this comment

adamrtalbot left a comment

Choose a reason for hiding this comment

maxulysse commented Dec 2, 2024 •

edited

Loading