-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial release review [DO NOT MERGE] #29
Conversation
* restrict target regions to canonical Ensembl transcripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally made it through :D . There are a few things that could probably be solved by already existing plugins or functions. I'd be curious to get your input. For me it is not a blocker for the first release but definitely something that should be added soon to make the pipeline components similar to the rest of nf-core.
Similarly for the test data.
after reviewing the pipeline and from the docs I think I would still not be certain when to use this pipeline, rnadnavar and when to use sarek. I left a note in the readme. Maybe you could clarify this a bit
Apply the seventh set of reviewer recommendations
Allow RNA BAM from samplesheet as LILAC input
I've created issues to continue on-going activities that do not block the initial release:
I've also marked remaining corresponding threads as resolved just to cleanly close off - let's continue discussing in the relevant GH issues or over Slack. |
As discussed on the nf-core Slack, all reviewers have given their approval and so we can now conclude the community review. There are several topics additional of discussion / points of action spun out from the review that will progress independently to the initial release (see above comment). An immense thank you to all of our reviewers - it's been a significant undertaking and I greatly appreciate all the time and feedback you've kindly given for this community review to be successful. |
Now closing this PR so I can open the 'Release 1.0.0' PR |
Background
nf-core/oncoanalyser is a cancer DNA/RNA analysis and reporting pipeline that implements the hmftools workflow, which is developed by the Hartwig Medical Foundation.
The hmftools workflow provides comprehensive analysis of cancer DNA/RNA data where each component is fine-tuned and optimised to operate in an integrated manner to improve genomic characterisation.
There are certain expectations around reference and input data, and how the hmftools workflow should be used. As such, these expectations have guided certain design decisions in oncoanalyser with some described below to provide further context for the community review.
Overarching design choices
Separation of run modes into workflows. The hmftools workflow is flexible and can be used to analyse different types of data (e.g. WGTS, targeted sequencing), which require specific configuration and reference data. There are also additional modes planned for hmftools that will eventually be implemented in oncoanalyser.
Hence, I have structured oncoanalyser so that the majority of the logic corresponding to different run modes is lifted up into workflows rather than at the subworkflow or module level. While this introduces some code duplication, much of the mode-specific logic is simplified by dividing it into different workflows providing a better maintenance and development experience. Retaining this layout will ease the addition of new run modes in the future.
Arbitrarily run individual tools or enter at any point. A key feature of oncoanalyser is the ability to begin analysis from any point in the pipeline or run just a specific set of tools. The primary use-case for this is to skip variant calling and run new/supplementary/modified downstream analyses. For example, a user can run oncoanalyser from PURPLE so long as they provide the required inputs.
Accordingly, outputs from each tool can be provided in the samplesheet and are recognised by oncoanalyser. When a user provides input for a given tool, it is retrieved in the respective subworkflow and inserted into the appropriate input channel. The feature to arbitrarily select tools to run is implemented by wrapping each tool subworkflow call in a conditional statement that checks whether the corresponding tool should be run.
Strong reference genome recommendations. Hartwig support only the reference genomes they distribute for analysis with the hmftools workflows. Hence, oncoanalyser is configured to provide users with a choice of the Hartwig GRCh37 or GRCh38 genome. While not recommended, custom reference genomes can be used (including those from iGenomes) and oncoanalyser will build all necessary indexes for the target analysis. There is additionally an option to write/publish prepared reference data to the output directory.
No egress fee reference data hosting. Given the use of Hartwig-distributed genomes and size of reference data required to run oncoanalyser, a cloud hosting provider is needed to deliver this data to users. I chose to host all default reference data (genomes, hmftools/panel resource files, and other databases) on Cloudflare R2, which is an object storage service that doesn't charge for egress. I've used Cloudflare R2 for this purpose since Jan 2023 with the nominal hosting costs being covered by our organisation (UMCCR).
Other notes