-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bwa-mem configuration for HiC #118
Comments
@tkchafin . Talk to Yumi and Ksenia. The question has come up multiple times in the pipelines meeting between them and Priyanka. Every time, the conclusion was "it's actually all fine", but please let's confirm again and record the reason as a comment somewhere in the code :) |
Ok, will put together some comparisons of read numbers and request a code review from treeval folks if it doesn't come out making sense
|
The Treeval bwa-mem2 configuration is specifically designed for the cram_filter module. We strictly keep only primary reads and pass them to bwa-mem2. The -5SPCp flags specify that the expected input reads are in interleaved paired end format and ensure that no reads are ignored (e.g., read1, read2, read1, read2). These settings are crucial for ensuring that each I/O operation functions correctly. We can discuss this further during the pipeline meeting if needed. @reichan1998 I don't understand why you set interleave=False for hic mapping as this will produce two fastq files, and when you using bwa-mem2 with -p, you will miss half of the read for sure, could you explain what you are trying to do please, and could you provide more information. |
I have checked the pipelines, both ALIGN_ILLUMINA and ALIGN_HIC would work if your input is cram file. And If you use fastq file as input then ALIGN_ILLUMINA expects two fastq files (read1 and read2) while ALIGN_HIC expects a single interleaved fastq input. @muffato Do you think we should make this a bit more clear and more consistent for the two subworkflows. |
Yes Treeval does the IO with
which isn't creating interleaved output: https://github.com/sanger-tol/readmapping/blob/main/modules/nf-core/samtools/fastq/main.nf However, in that case we are still calling bwa-mem with
So this makes a call like this in the case of HiC CRAM input:
This wa discovered by Chau. Note we are passing read1 and read2 but also using |
two options: Making -p as flexible argument can be also good in this case. |
Yes I just wanted to have a second pair of eyes confirm that this was wrong and I wasn't overlooking something. It looks it would have been introduced in 1.1.0: interleave=False: https://github.com/sanger-tol/readmapping/blob/1.1.0/subworkflows/local/align_short.nf The issue would only apply to HiC CRAM inputs |
Did a quick verification, with 1.2.2 the CRAM hic input mapped is:
With the latest PR we get:
Both with bwa-mem2. So I will mark this as resolved. Thanks @reichan1998 for finding this issue! |
Description of the bug
@reichan1998 and I came across this while porting bwa-mem for Illumina into the map-reduce setup from treeval. I don't understand why we have
bwa-mem
for HiC configured in this way, but we first convert CRAM to FASTQ, with interleave=False so we are splitting read 1 and read 2 sets. These are then both passed tobwa-mem
, but in the case of HiC we are using the-p
flag:According to the docs, this means we ignore the _R2 reads and treat the _R1 file as interleaved:
Read 2 is non-empty for HiC runs after the CRAM->FASTQ conversion, so would be discarded when running with
-p
(?).In Treeval,
SAMTOOLS_FASTQ
is run creating interleaved output, then passed explicitly as interleaved (-p
) toBWA-MEM2
:This results in an output CRAM with half the number of reads. It has been there since at least v1.0.0, and I haven't worked much with HiC data, so I wanted to check if there was a reason why we do it this way, or if I am misinterpreting the flags?
@muffato
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: