Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation parallelization #84

Closed
wants to merge 7 commits into from
Closed

Annotation parallelization #84

wants to merge 7 commits into from

Conversation

nictru
Copy link
Collaborator

@nictru nictru commented Nov 19, 2023

The annotation currently works by going through a - potentially very large - file containing circRNA location and investigating each of them separately. This changes introduce parallelization to this process

@nictru nictru added the enhancement Improvement for existing functionality label Nov 19, 2023
@nictru nictru self-assigned this Nov 19, 2023
@nictru nictru mentioned this pull request Nov 19, 2023
@rreggiar rreggiar self-assigned this Nov 27, 2023
Copy link

@rreggiar rreggiar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is critical for more complex data, glad we both arrived at the solution separately. Everything makes sense -- I left some notes on the parallel implementation but that is an enhancement we can think about later as long as this works.

annotate_outputs.sh $exon_boundary &> ${prefix}.log
mkdir -p bed12

parallel -j $task.cpus -a circs.bed annotate_outputs.sh $exon_boundary {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since order doesn't matter, you could use parallel -u to speed up output. I went with a split > parallel > pool approach in my fork and feel like the implementation in this commit can be optimized, but as long as this works for now we can worry about that later.

@nictru
Copy link
Collaborator Author

nictru commented Nov 29, 2023

Answer to the comments from #85

I switched to only annotating the combined circRNAs because previously this was done for each detection tool for each sample. This led to a large number of long-running tasks, which most likely had a large overlap. I understand that it might be interesting to have tool-specific annotations, but I think this should then be approached like this:

  1. Union of all circRNAs
  2. Split into multiple files (by chromosome or line number)
  3. Annotate
  4. Union of annotations
  5. For each sample-tool combination extract the fitting annotations

The core problem here is, that even with the parallelized annotation, the annotation can take hours per task.

Notes on a potentially more elegant approach:
I tried using bedtools intersect directly on all circRNAs and the GTF file, which was super fast in comparison. The only problem here is, that we need to do some aggregation of all the hits for each circRNA afterwards. I can imagine that this could work well using pandas.groupBy.

@rreggiar
Copy link

Yeah the annotation bottleneck is a major problem right now. Even if we intersect everything together, we'll need to apply the same logic to each group, correct? There may be a more efficient implementation of this, I can try some things. Do we want to work off a consistent intersection output? Do you have one handy that is small enough to share?

@nictru
Copy link
Collaborator Author

nictru commented Dec 3, 2023

I sent you a minimal example based on the test configuration via slack. Currently the same bash script is applied to each batch (step 3 of the process):

  1. Union of all circRNAs
  2. Split into multiple files (by chromosome or line number)
  3. Annotate
  4. Union of annotations
  5. For each sample-tool combination extract the fitting annotations

I still believe this could be a working approach:

I tried using bedtools intersect directly on all circRNAs and the GTF file, which was super fast in comparison. The only problem here is, that we need to do some aggregation of all the hits for each circRNA afterwards. I can imagine that this could work well using pandas.groupBy.

Note, that the bedtools intersect gives one line for each hit of a given circRNA in the GTF file, so potentially multiple for each circRNA. We could then use some grouping mechanism to perform the main logic of the annotation script (determination of circRNA type and extraction of additional properties).

Copy link

github-actions bot commented Jan 13, 2024

nf-core lint overall result: Failed ❌

Posted for pipeline commit 8afed2c

+| ✅ 179 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   2 tests had warnings |!
-| ❌  14 tests failed       |-

❌ Test failures:

  • files_exist - File must be removed: lib/nfcore_external_java_deps.jar
  • nextflow_config - Config default value incorrect: params.outdir is set as ./results in nextflow_schema.json but is null in nextflow.config.
  • nextflow_config - Config default value incorrect: params.segemehl is set as None in nextflow_schema.json but is null in nextflow.config.
  • nextflow_config - Config default value incorrect: params.max_cpus is set as 16 in nextflow_schema.json but is 50 in nextflow.config.
  • nextflow_config - Config default value incorrect: params.max_memory is set as 128.GB in nextflow_schema.json but is 300.GB in nextflow.config.
  • files_unchanged - .github/workflows/branch.yml does not match the template
  • files_unchanged - .github/workflows/linting_comment.yml does not match the template
  • files_unchanged - .github/workflows/linting.yml does not match the template
  • files_unchanged - assets/email_template.html does not match the template
  • files_unchanged - assets/email_template.txt does not match the template
  • files_unchanged - assets/nf-core-circrna_logo_light.png does not match the template
  • files_unchanged - docs/images/nf-core-circrna_logo_light.png does not match the template
  • files_unchanged - docs/images/nf-core-circrna_logo_dark.png does not match the template
  • files_unchanged - pyproject.toml does not match the template

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • system_exit - System.exit in circrna.nf: System.exit(1) [line 43]

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.12
  • Run at 2024-01-31 09:21:26

@nictru
Copy link
Collaborator Author

nictru commented Feb 2, 2024

#95 fixes this better

@nictru nictru closed this Feb 2, 2024
@nictru nictru deleted the annotation branch February 28, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants