-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplify TABIX generation + refactor #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not completely following the refactoring, but in general:
- Could we have comments on inputs and outputs to indicate expected structures/ formats? Seem to have got lost in the refactoring.
- Can we name subworkflows in a more consistent/ useful way?
- Maybe add a comment to each subworkflow explaining its purpose?
|
||
main: | ||
|
||
intervals_bed = reference.map { meta, input_intervals_bed, input_fasta, input_fasta_dict, input_fasta_fai, input_fasta_sizes, input_gff, input_gtf, input_splice_sites, input_transcript_fasta, input_vcf, input_readme, input_bed12, input_mito_name, input_macs_gsize -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't a multimap save the repetition of the map operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently multiMap does not allow for outputing null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe output a string 'NULL' or something and then have a map to switch it out to a NULL in the emits?
Imagine having to fix all of these if the input channel changes in structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I had before, and I had to remap every single channel, so I'd rather just use multiple maps for now rather than multiMap + multiple maps.
I'll make an MRE for Ben and others to solve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic I do follow is sound, but there's a lot here and I don't quite follow what is going on.
@@ -0,0 +1,60 @@ | |||
workflow SAMPLESHEET_TO_CHANNEL { | |||
take: | |||
reference // channel: [meta, intervals_bed, fasta, fasta_dict, fasta_fai, fasta_sizes, gff, gtf, splice_sites, transcript_fasta, vcf, readme, bed12, mito_name, macs_gsize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use a channel of val(map) here:
[
meta: metamap
intervals_bed: file(intervals_bed)
...
]
Then convert it to a normal tuple before use with a map. This will be similar to Ben's typing implementation that is coming.
myMap.map { myMap -> [ myMap.meta, myMap.intervals_bed ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So happy to improve the input handling later on, but for now this is working, I'd rather not change it now
return input_intervals_bed ? [meta, input_intervals_bed] : null | ||
} | ||
|
||
fasta = reference.map { meta, input_intervals_bed, input_fasta, input_fasta_dict, input_fasta_fai, input_fasta_sizes, input_gff, input_gtf, input_splice_sites, input_transcript_fasta, input_vcf, input_readme, input_bed12, input_mito_name, input_macs_gsize -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use a custom class like Jason Fan demonstrated at the Boston summit...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it very hard to follow what's going on here, so I'm going to trust you here.
Could you update the PR description to describe what the refactoring is?
Simplifying handling of vcf files:
While doing that I did some code refactoring to properly handle case when input contains only vcfs or only fastas and tools that would need these assets were failing.
What the refactoring actually did, and what it implies:
So as this is just for v1 of references (ie updating igenomes), I work with a very bad versions of the assets files we envision for v2 (aka igenomes reboot).
So with that end-product in mind, my plan is to be able from assets that describe what we currently have in igenomes to generate what is missing.
We currently have a very permissive setup for the assets file, for which what is required is just the meta data (ie genome, species, build).
Which mean we can index a single vcf without caring about the fasta (because we shouldn't).
Or we can build intervals from a fasta.fai without even caring about the fasta (just because we can).
And that was actually already possible before this code refactoring.
What I did in this refactoring was simplify vcf index generation by splitting the input, so instead of 4 different process for handling just 4 types of vcf, we have now a single process to index any vcf.
I moved the input handling in its own subworkflow just to simplify the readability of the workflow.
And replaced the multiMap wih multiple maps as I noticed some issues with the multiMap operator.
Plan is to rework that done the line.
I grouped together into badly named subworkflows some modules that made send to be grouped together based on what they were doing and what files they need, so that the workflow is nicer.
Plan for this is definitively open and there will probably be some rewrite done the line too.
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).