-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotation parallelization #84
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is critical for more complex data, glad we both arrived at the solution separately. Everything makes sense -- I left some notes on the parallel implementation but that is an enhancement we can think about later as long as this works.
annotate_outputs.sh $exon_boundary &> ${prefix}.log | ||
mkdir -p bed12 | ||
|
||
parallel -j $task.cpus -a circs.bed annotate_outputs.sh $exon_boundary {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since order doesn't matter, you could use parallel -u
to speed up output. I went with a split > parallel > pool approach in my fork and feel like the implementation in this commit can be optimized, but as long as this works for now we can worry about that later.
Answer to the comments from #85 I switched to only annotating the combined circRNAs because previously this was done for each detection tool for each sample. This led to a large number of long-running tasks, which most likely had a large overlap. I understand that it might be interesting to have tool-specific annotations, but I think this should then be approached like this:
The core problem here is, that even with the parallelized annotation, the annotation can take hours per task. Notes on a potentially more elegant approach: |
Yeah the annotation bottleneck is a major problem right now. Even if we intersect everything together, we'll need to apply the same logic to each group, correct? There may be a more efficient implementation of this, I can try some things. Do we want to work off a consistent intersection output? Do you have one handy that is small enough to share? |
I sent you a minimal example based on the test configuration via slack. Currently the same bash script is applied to each batch (step 3 of the process):
I still believe this could be a working approach:
Note, that the |
|
#95 fixes this better |
The annotation currently works by going through a - potentially very large - file containing circRNA location and investigating each of them separately. This changes introduce parallelization to this process