Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify TABIX generation + refactor #48

Merged
merged 28 commits into from
Dec 4, 2024
Merged

Conversation

maxulysse
Copy link
Member

@maxulysse maxulysse commented Dec 2, 2024

Simplifying handling of vcf files:

  • I think we only need to need 1 single TABIX_TABIX process and generate any necessary tbi from it

While doing that I did some code refactoring to properly handle case when input contains only vcfs or only fastas and tools that would need these assets were failing.

What the refactoring actually did, and what it implies:

So as this is just for v1 of references (ie updating igenomes), I work with a very bad versions of the assets files we envision for v2 (aka igenomes reboot).
So with that end-product in mind, my plan is to be able from assets that describe what we currently have in igenomes to generate what is missing.

We currently have a very permissive setup for the assets file, for which what is required is just the meta data (ie genome, species, build).
Which mean we can index a single vcf without caring about the fasta (because we shouldn't).
Or we can build intervals from a fasta.fai without even caring about the fasta (just because we can).

And that was actually already possible before this code refactoring.

What I did in this refactoring was simplify vcf index generation by splitting the input, so instead of 4 different process for handling just 4 types of vcf, we have now a single process to index any vcf.

I moved the input handling in its own subworkflow just to simplify the readability of the workflow.
And replaced the multiMap wih multiple maps as I noticed some issues with the multiMap operator.
Plan is to rework that done the line.

I grouped together into badly named subworkflows some modules that made send to be grouped together based on what they were doing and what files they need, so that the workflow is nicer.
Plan for this is definitively open and there will probably be some rewrite done the line too.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/references branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@maxulysse maxulysse changed the title simplify TABIX generation simplify TABIX generation + refacroe Dec 3, 2024
@maxulysse maxulysse changed the title simplify TABIX generation + refacroe simplify TABIX generation + refactor Dec 3, 2024
Copy link
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completely following the refactoring, but in general:

  • Could we have comments on inputs and outputs to indicate expected structures/ formats? Seem to have got lost in the refactoring.
  • Can we name subworkflows in a more consistent/ useful way?
  • Maybe add a comment to each subworkflow explaining its purpose?

subworkflows/local/samplesheet_to_channel/main.nf Outdated Show resolved Hide resolved
subworkflows/local/index_vcf/main.nf Outdated Show resolved Hide resolved
subworkflows/local/create_align_index/main.nf Show resolved Hide resolved

main:

intervals_bed = reference.map { meta, input_intervals_bed, input_fasta, input_fasta_dict, input_fasta_fai, input_fasta_sizes, input_gff, input_gtf, input_splice_sites, input_transcript_fasta, input_vcf, input_readme, input_bed12, input_mito_name, input_macs_gsize ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't a multimap save the repetition of the map operation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently multiMap does not allow for outputing null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe output a string 'NULL' or something and then have a map to switch it out to a NULL in the emits?

Imagine having to fix all of these if the input channel changes in structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I had before, and I had to remap every single channel, so I'd rather just use multiple maps for now rather than multiMap + multiple maps.
I'll make an MRE for Ben and others to solve

Copy link

@adamrtalbot adamrtalbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic I do follow is sound, but there's a lot here and I don't quite follow what is going on.

@@ -0,0 +1,60 @@
workflow SAMPLESHEET_TO_CHANNEL {
take:
reference // channel: [meta, intervals_bed, fasta, fasta_dict, fasta_fai, fasta_sizes, gff, gtf, splice_sites, transcript_fasta, vcf, readme, bed12, mito_name, macs_gsize]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a channel of val(map) here:

[
    meta: metamap
    intervals_bed: file(intervals_bed)
    ...
]

Then convert it to a normal tuple before use with a map. This will be similar to Ben's typing implementation that is coming.

myMap.map { myMap -> [ myMap.meta, myMap.intervals_bed ]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So happy to improve the input handling later on, but for now this is working, I'd rather not change it now

return input_intervals_bed ? [meta, input_intervals_bed] : null
}

fasta = reference.map { meta, input_intervals_bed, input_fasta, input_fasta_dict, input_fasta_fai, input_fasta_sizes, input_gff, input_gtf, input_splice_sites, input_transcript_fasta, input_vcf, input_readme, input_bed12, input_mito_name, input_macs_gsize ->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a custom class like Jason Fan demonstrated at the Boston summit...

Copy link

@adamrtalbot adamrtalbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it very hard to follow what's going on here, so I'm going to trust you here.

Could you update the PR description to describe what the refactoring is?

@maxulysse maxulysse merged commit ec829e0 into nf-core:dev Dec 4, 2024
18 checks passed
@maxulysse maxulysse deleted the simplify_vcf branch December 4, 2024 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants