Annotation parallelization #84

nictru · 2023-11-19T08:59:16Z

The annotation currently works by going through a - potentially very large - file containing circRNA location and investigating each of them separately. This changes introduce parallelization to this process

rreggiar

IMO this is critical for more complex data, glad we both arrived at the solution separately. Everything makes sense -- I left some notes on the parallel implementation but that is an enhancement we can think about later as long as this works.

rreggiar · 2023-11-27T18:24:43Z

modules/local/annotation/full_annotation/main.nf

-    annotate_outputs.sh $exon_boundary &> ${prefix}.log
+    mkdir -p bed12
+
+    parallel -j $task.cpus -a circs.bed annotate_outputs.sh $exon_boundary {}


Since order doesn't matter, you could use parallel -u to speed up output. I went with a split > parallel > pool approach in my fork and feel like the implementation in this commit can be optimized, but as long as this works for now we can worry about that later.

nictru · 2023-11-29T09:16:08Z

Answer to the comments from #85

I switched to only annotating the combined circRNAs because previously this was done for each detection tool for each sample. This led to a large number of long-running tasks, which most likely had a large overlap. I understand that it might be interesting to have tool-specific annotations, but I think this should then be approached like this:

Union of all circRNAs
Split into multiple files (by chromosome or line number)
Annotate
Union of annotations
For each sample-tool combination extract the fitting annotations

The core problem here is, that even with the parallelized annotation, the annotation can take hours per task.

Notes on a potentially more elegant approach:
I tried using bedtools intersect directly on all circRNAs and the GTF file, which was super fast in comparison. The only problem here is, that we need to do some aggregation of all the hits for each circRNA afterwards. I can imagine that this could work well using pandas.groupBy.

rreggiar · 2023-11-29T18:27:36Z

Yeah the annotation bottleneck is a major problem right now. Even if we intersect everything together, we'll need to apply the same logic to each group, correct? There may be a more efficient implementation of this, I can try some things. Do we want to work off a consistent intersection output? Do you have one handy that is small enough to share?

nictru · 2023-12-03T18:59:21Z

I sent you a minimal example based on the test configuration via slack. Currently the same bash script is applied to each batch (step 3 of the process):

Union of all circRNAs

Split into multiple files (by chromosome or line number)

Annotate

Union of annotations

For each sample-tool combination extract the fitting annotations

I still believe this could be a working approach:

I tried using bedtools intersect directly on all circRNAs and the GTF file, which was super fast in comparison. The only problem here is, that we need to do some aggregation of all the hits for each circRNA afterwards. I can imagine that this could work well using pandas.groupBy.

Note, that the bedtools intersect gives one line for each hit of a given circRNA in the GTF file, so potentially multiple for each circRNA. We could then use some grouping mechanism to perform the main logic of the annotation script (determination of circRNA type and extraction of additional properties).

github-actions · 2024-01-13T15:28:42Z

`nf-core lint` overall result: Failed ❌

Posted for pipeline commit 8afed2c

+| ✅ 179 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   2 tests had warnings |!
-| ❌  14 tests failed       |-

❌ Test failures:

files_exist - File must be removed: lib/nfcore_external_java_deps.jar
nextflow_config - Config default value incorrect: params.outdir is set as ./results in nextflow_schema.json but is null in nextflow.config.
nextflow_config - Config default value incorrect: params.segemehl is set as None in nextflow_schema.json but is null in nextflow.config.
nextflow_config - Config default value incorrect: params.max_cpus is set as 16 in nextflow_schema.json but is 50 in nextflow.config.
nextflow_config - Config default value incorrect: params.max_memory is set as 128.GB in nextflow_schema.json but is 300.GB in nextflow.config.
files_unchanged - .github/workflows/branch.yml does not match the template
files_unchanged - .github/workflows/linting_comment.yml does not match the template
files_unchanged - .github/workflows/linting.yml does not match the template
files_unchanged - assets/email_template.html does not match the template
files_unchanged - assets/email_template.txt does not match the template
files_unchanged - assets/nf-core-circrna_logo_light.png does not match the template
files_unchanged - docs/images/nf-core-circrna_logo_light.png does not match the template
files_unchanged - docs/images/nf-core-circrna_logo_dark.png does not match the template
files_unchanged - pyproject.toml does not match the template

❗ Test warnings:

readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
system_exit - System.exit in circrna.nf: System.exit(1) [line 43]

❔ Tests ignored:

files_unchanged - File does not exist: lib/nfcore_external_java_deps.jar

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-circrna_logo_light.png
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-circrna_logo_light.png
files_exist - File found: docs/images/nf-core-circrna_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: .github/workflows/awstest.yml
files_exist - File found: .github/workflows/awsfulltest.yml
files_exist - File found: lib/WorkflowCircrna.groovy
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-circrna_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.tool
nextflow_config - Config default value correct: params.module
nextflow_config - Config default value correct: params.bsj_reads
nextflow_config - Config default value correct: params.tool_filter
nextflow_config - Config default value correct: params.duplicates_fun
nextflow_config - Config default value correct: params.save_intermediates
nextflow_config - Config default value correct: params.exon_boundary
nextflow_config - Config default value correct: params.sjdboverhang
nextflow_config - Config default value correct: params.chimJunctionOverhangMin
nextflow_config - Config default value correct: params.alignSJDBoverhangMin
nextflow_config - Config default value correct: params.chimSegmentMin
nextflow_config - Config default value correct: params.seglen
nextflow_config - Config default value correct: params.min_intron
nextflow_config - Config default value correct: params.max_intron
nextflow_config - Config default value correct: params.min_map_len
nextflow_config - Config default value correct: params.min_fusion_distance
nextflow_config - Config default value correct: params.save_unaligned
nextflow_config - Config default value correct: params.save_reference
nextflow_config - Config default value correct: params.hisat2_build_memory
nextflow_config - Config default value correct: params.skip_trimming
nextflow_config - Config default value correct: params.save_trimmed
nextflow_config - Config default value correct: params.skip_fastqc
nextflow_config - Config default value correct: params.min_trimmed_reads
nextflow_config - Config default value correct: params.custom_config_version
nextflow_config - Config default value correct: params.custom_config_base
nextflow_config - Config default value correct: params.max_time
nextflow_config - Config default value correct: params.publish_dir_mode
nextflow_config - Config default value correct: params.max_multiqc_email_size
nextflow_config - Config default value correct: params.validate_params
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/NfcoreTemplate.groovy matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml does not use -profile test
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
pipeline_todos - No TODO strings found
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (296 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: awsfulltest.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
actions_schema_validation - Workflow validation passed: awstest.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: ci.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
multiqc_config - 'assets/multiqc_config.yml' contains report_section_order
multiqc_config - 'assets/multiqc_config.yml' contains export_plots
multiqc_config - 'assets/multiqc_config.yml' contains report_comment
multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.12
Run at 2024-01-31 09:21:26

nictru · 2024-02-02T12:03:58Z

#95 fixes this better

nictru added the enhancement Improvement for existing functionality label Nov 19, 2023

nictru self-assigned this Nov 19, 2023

nictru mentioned this pull request Nov 19, 2023

Single stranded #82

Merged

rreggiar self-assigned this Nov 27, 2023

rreggiar approved these changes Nov 27, 2023

View reviewed changes

nictru mentioned this pull request Nov 29, 2023

Implement unified quantification #85

Closed

nictru force-pushed the annotation branch from 41b8d96 to c15739f Compare November 29, 2023 09:22

nictru added 7 commits January 31, 2024 10:18

Implement multi-threading in annotation

29de6f0

Switch to annotating only combined circRNAs

f4c3e0a

Parallelize annotation chromosome-wise

5cd8472

Fix annotation channel structure

9654d3f

Prettier

9412724

Implement bedtools-pandas annotation

191a5b8

Add transcript ID to annotation

8afed2c

nictru force-pushed the annotation branch from 59f77c7 to 8afed2c Compare January 31, 2024 09:20

nictru closed this Feb 2, 2024

nictru deleted the annotation branch February 28, 2024 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotation parallelization #84

Annotation parallelization #84

nictru commented Nov 19, 2023

rreggiar left a comment

rreggiar Nov 27, 2023

nictru commented Nov 29, 2023

rreggiar commented Nov 29, 2023

nictru commented Dec 3, 2023

github-actions bot commented Jan 13, 2024 •

edited

Loading

❌ Test failures:

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

nictru commented Feb 2, 2024

Annotation parallelization #84

Annotation parallelization #84

Conversation

nictru commented Nov 19, 2023

rreggiar left a comment

Choose a reason for hiding this comment

rreggiar Nov 27, 2023

Choose a reason for hiding this comment

nictru commented Nov 29, 2023

rreggiar commented Nov 29, 2023

nictru commented Dec 3, 2023

github-actions bot commented Jan 13, 2024 • edited Loading

nf-core lint overall result: Failed ❌

❌ Test failures:

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

nictru commented Feb 2, 2024

github-actions bot commented Jan 13, 2024 •

edited

Loading

`nf-core lint` overall result: Failed ❌