Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bwa-mem configuration for HiC #118

Closed
tkchafin opened this issue Sep 12, 2024 · 8 comments · Fixed by #113
Closed

bwa-mem configuration for HiC #118

tkchafin opened this issue Sep 12, 2024 · 8 comments · Fixed by #113
Labels
bug Something isn't working

Comments

@tkchafin
Copy link
Contributor

tkchafin commented Sep 12, 2024

Description of the bug

@reichan1998 and I came across this while porting bwa-mem for Illumina into the map-reduce setup from treeval. I don't understand why we have bwa-mem for HiC configured in this way, but we first convert CRAM to FASTQ, with interleave=False so we are splitting read 1 and read 2 sets. These are then both passed to bwa-mem, but in the case of HiC we are using the -p flag:

    withName: '.*:.*:ALIGN_HIC:BWAMEM2_MEM' {
        ext.args = { "-5SPCp -R ${meta.read_group}" }
    }

    withName: '.*:.*:ALIGN_ILLUMINA:BWAMEM2_MEM' {
        ext.args = { "-R ${meta.read_group}" }
    }

According to the docs, this means we ignore the _R2 reads and treat the _R1 file as interleaved:

   -p            smart pairing (ignoring in2.fq)

Read 2 is non-empty for HiC runs after the CRAM->FASTQ conversion, so would be discarded when running with -p (?).

In Treeval, SAMTOOLS_FASTQ is run creating interleaved output, then passed explicitly as interleaved (-p) to BWA-MEM2:

    withName: ".*:.*:HIC_BWAMEM2:CRAM_FILTER_ALIGN_BWAMEM2_FIXMATE_SORT" {
        ext.args        = ""
        ext.args1       = "-F0xB00 -nt"      # samtools fastq args
        ext.args2       = { "-5SPCp -H'${rglines}'" }      # bwa-mem2 args
        ext.args3       = "-mpu"  
        ext.args4       = { "--write-index -l1" }
    }

This results in an output CRAM with half the number of reads. It has been there since at least v1.0.0, and I haven't worked much with HiC data, so I wanted to check if there was a reason why we do it this way, or if I am misinterpreting the flags?

@muffato

Command used and terminal output

No response

Relevant files

No response

System information

No response

@tkchafin tkchafin added the bug Something isn't working label Sep 12, 2024
@muffato
Copy link
Member

muffato commented Sep 12, 2024

@tkchafin . Talk to Yumi and Ksenia. The question has come up multiple times in the pipelines meeting between them and Priyanka. Every time, the conclusion was "it's actually all fine", but please let's confirm again and record the reason as a comment somewhere in the code :)

@tkchafin
Copy link
Contributor Author

tkchafin commented Sep 12, 2024 via email

@yumisims
Copy link

yumisims commented Sep 13, 2024

The Treeval bwa-mem2 configuration is specifically designed for the cram_filter module. We strictly keep only primary reads and pass them to bwa-mem2. The -5SPCp flags specify that the expected input reads are in interleaved paired end format and ensure that no reads are ignored (e.g., read1, read2, read1, read2). These settings are crucial for ensuring that each I/O operation functions correctly. We can discuss this further during the pipeline meeting if needed. @reichan1998 I don't understand why you set interleave=False for hic mapping as this will produce two fastq files, and when you using bwa-mem2 with -p, you will miss half of the read for sure, could you explain what you are trying to do please, and could you provide more information.

@yumisims
Copy link

I have checked the pipelines, both ALIGN_ILLUMINA and ALIGN_HIC would work if your input is cram file. And If you use fastq file as input then ALIGN_ILLUMINA expects two fastq files (read1 and read2) while ALIGN_HIC expects a single interleaved fastq input.

@muffato Do you think we should make this a bit more clear and more consistent for the two subworkflows.

@tkchafin
Copy link
Contributor Author

Yes Treeval does the IO with -p correctly. In readmapping, we convert the CRAM to FASTQ first using the separate module:

    // Convert from CRAM to FASTQ only if CRAM files were provided as input
    SAMTOOLS_FASTQ ( ch_reads.cram, false )
    ch_versions = ch_versions.mix ( SAMTOOLS_FASTQ.out.versions.first() )
    
    
    SAMTOOLS_FASTQ.out.fastq
    | mix ( ch_reads.fastq )
    | set { ch_reads_fastq }


    // Align Fastq to Genome and output sorted BAM
    BWAMEM2_MEM ( ch_reads_fastq, index, fasta, true )
    ch_versions = ch_versions.mix ( BWAMEM2_MEM.out.versions.first() )

which isn't creating interleaved output: https://github.com/sanger-tol/readmapping/blob/main/modules/nf-core/samtools/fastq/main.nf

However, in that case we are still calling bwa-mem with -p (only for HiC):

    withName: '.*:.*:ALIGN_HIC:BWAMEM2_MEM' {
        ext.args = { "-5SPCp -R ${meta.read_group}" }
    }

So this makes a call like this in the case of HiC CRAM input:

bwa-mem2 \
    mem \
    -5SPCp -R '@RG\tID:35528_2%231\tPL:ILLUMINA\tSM:mMelMel3' \
    -t 2 \
    $INDEX \
    mMelMel3_T1_1.fastq.gz mMelMel3_T1_2.fastq.gz\
    | samtools sort  -@ 2  -o mMelMel3_T1.bam -

This wa discovered by Chau. Note we are passing read1 and read2 but also using -p. For Illumina CRAM inputs we handle it correctly because we don't set -p, but with HiC CRAM inputs I think we must be dumping the read2 files (only in readmapping, not in treeval)

@yumisims
Copy link

two options:
(1) remove the -p from ALIGN_HIC.
(2) add -p to ALIGN_ILLUMINA and set interleave as true.

Making -p as flexible argument can be also good in this case.

@tkchafin
Copy link
Contributor Author

Yes I just wanted to have a second pair of eyes confirm that this was wrong and I wasn't overlooking something. It looks it would have been introduced in 1.1.0:

interleave=False: https://github.com/sanger-tol/readmapping/blob/1.1.0/subworkflows/local/align_short.nf
bwa-mem with -p: https://github.com/sanger-tol/readmapping/blob/1.1.0/conf/modules.config

The issue would only apply to HiC CRAM inputs

@tkchafin tkchafin linked a pull request Sep 16, 2024 that will close this issue
9 tasks
@tkchafin tkchafin moved this to Todo in Genome After Party Sep 17, 2024
@tkchafin tkchafin moved this to To do in readmapping Sep 17, 2024
@tkchafin
Copy link
Contributor Author

Did a quick verification, with 1.2.2 the CRAM hic input mapped is:

(nf-core) tc25@mib119777s hic % samtools view -c GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
10125

With the latest PR we get:

(nf-core) tc25@mib119777s hic % samtools view -c GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
20331

Both with bwa-mem2. So I will mark this as resolved. Thanks @reichan1998 for finding this issue!

@github-project-automation github-project-automation bot moved this from Todo to Done in Genome After Party Sep 17, 2024
@github-project-automation github-project-automation bot moved this from To do to Done in readmapping Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants