Mouse PTA ReadMe

Paired Tumor Analysis (PTA) Documentation

Paired Tumor Analysis Pipeline (--workflow pta, --gen_org mouse)

For all input samples:

•   Fastp read quality and adapter trimming   
•   Get Read Group Information   
•   BWA-MEM Alignment   
•   Picard SortSam and Mark Duplicates    
•   Collect Alignment Summary Metrics

If paired sample:

•   Germline variant calling  
•   Germline variant filtering   
•   Germline variant annotation

For all samples (C57L_J used if unpaired):

•	GATK Mutect2 variant calling  
•	Filter Mutect2 calls  
•	Lancet variant calling  
•	Svaba SV calling   
•	Manta SV calling   
•	Strelka2 SNV and SV calling   
•	Lumpy SV calling   
•	Delly SV calling   
•	Delly CNV calling   
•	VCF Merge prep steps. A number of scripts are applied to ensure samples within VCF files are properly oriented prior to merging, and calls are properly formatted   
•	Intergenic variant rescue and confirmation via Lancet. A number of scripts are applied to determine if intergenic variants are supported by the caller Lancet   
•	SNV VCF files are merged across all SNV callers  
•	VEP annotation of merged SNVs  
•	Bicseq2 CNV calls are annotated   
•	SV calls are merged across callers  
•	SV calls are annotated with known insertion, deletion, transversions and exclusion regions. Annotation is done at 80% overlap between called SV event and known event size.       
•	SV calls are annotated with CNV regions  
•	A final filter is applied to CNV annotated SV calls  
•	MultiQC report generation

flowchart TD
    p00((CSV Sample Sheet))
    p01[PARSE_SAMPLE_SHEET:CONCATENATE_READS]
    p00 --> p01
    p01 --> |Tumor Sample| t02
    p01 --> |"`Normal Sample:
               **If Normal Sample 
               Provided**`"| n02


    subgraph tumor [  ]
        t02[FASTP]
        t03[BWA_MEM]
        t04[PICARD_SORTSAM]
        t05[PICARD_MARKDUPLICATES]
        %% NOTE: BaseRecalibrator and BQSR requires known sites to recal around.

        to1([Tumor Genomic Bam]):::output
        t02 --> t03
        t03 --> t04
        t04 --> t05
        t05 --> to1
        
    end

    subgraph normal [  ]
        n02[FASTP]
        n03[BWA_MEM]
        n04[PICARD_SORTSAM]
        n05[PICARD_MARKDUPLICATES]
        %% REMOVE DUPE
        
        %% Indel realigner from GATK

        %% NOTE: BaseRecalibrator and BQSR requires known sites to recal around.
        no1([Normal Genomic Bam]):::output

        n02 --> n03
        n03 --> n04
        n04 --> n05
        n05 --> no1
    end


    no1 -..-> |If Normal\nSample Provided| m1
    altBAM -..-> |If No Normal\nSample Provided| m1
    to1 --> m1
    altBAM[(ALT\nBAM)]
    m1 --> p35
    m1 --> p39
    m1 --> p41 
    m1 --> p42
    m1 --> p43 
    m1 --> p44
    m1 --> p54
    m1 --> p57.1
    m1((Join:\nTumor & Normal))


    subgraph germline [  ]
        no1 -..-> |If Normal\nSample Provided|p20
        p20[GATK_HAPLOTYPECALLER_SV_GERMLINE]
        p21[GATK_SORTVCF_GERMLINE]
        p27[BCFTOOLS_GERMLINE_FILTER]
        %% p28[BCFTOOLS_SPLITMULTIALLELIC_REGIONS]
        %% p29[VEP_GERMLINE]
        %% p30[BCFTOOLS_REMOVESPANNING]
        %% p33[SNPSIFT_ANNOTATE_DBSNP_GERMLINE]
        %% p34[GERMLINE_VCF_FINALIZATION]
        annot_summary1{{Germline Annotation via VEP\nSteps truncated for figure clarity}}

        o3([Germline Variants]):::output
        o4([Annotated Germline Variants]):::output

        p20 --> p21
        p21 --> p27
        p27 --> o3
        o3 --> annot_summary1
        subgraph germline_annot [  ]
            %% p27 --> p28
            %% p28 --> p29
            %% p29 --> p30
            %% p30 --> p31
            %% p31 --> p32 
            %% p32 --> p33 
            %% p33 --> p34
            %% p34 --> o4
            annot_summary1 --> o4
        end
    end



    subgraph somatic_variant [  ]

        p35[GATK_MUTECT2]
        p36[GATK_SORTVCF_MUTECT]
        p37[GATK_MERGEMUTECTSTATS]
        p38[GATK_FILTERMUECTCALLS]
        o5([Mutect2 SNV Calls]):::output

        p44[LANCET]
        p45[GATK_SORTVCF_LANCET]
        oo6([Lancet SNV Calls]):::output


        p39[DELLY_SOMATIC]
        p40[DELLY_FILTER_SOMATIC]

        o6([DELLY SV Calls]):::output

        p41[MANTA]
        o7([Manta SV and SNV Calls]):::output

        p42[STRELKA2]
        o8([Strelka2 SNV Calls]):::output

        p43[SMOOVE]
        o9([SMOOVE_Lumpy SV Calls]):::output

        p54[DELLY_CNV_SOMATIC]

        p55[BCFTOOLS_MERGE_DELLY_CNV]
        p56[DELLY_CLASSIFY]
        p57[BCFTOOLS_QUERY_DELLY_CNV]
        o10([DELLY CNV Calls]):::output

        p57.1[SVABA]
        o10.1([SVABA SV and SNV Calls]):::output




        p35 --> p36
        p36 --> p37 
        p37 --> p38
        p38 --> o5
        
        p39 --> p40
        p40 --> o6 

        p41 --> o7

        o7 --> p42
        p42 --> o8

        p43 --> o9

        p44 --> p45
        p45 --> oo6

        p54 --> p55
        p55 --> p56
        p56 --> p57
        p57 --> o10

        p57.1 --> o10.1


        note1{{Merge prep.\nSteps truncated}}
        note2{{Merge prep.\nSteps truncated}}
        note3{{Merge prep.\nSteps truncated}}
        note3.5{{Merge prep.\nSteps truncated}}
        note3.6{{Merge prep.\nSteps truncated}}
        %% p57[RENAME_METADATA]
        %% p58[MERGE_PREP]
        %% p59[RENAME_VCF]
        %% p60[COMPRESS_INDEX_VCF]
        %% p61[BCFTOOLS_SPLITMULTIALLELIC]
        %% p62[SPLIT_MNV]
        %% p63[GATK_SORTVCF_TOOLS]

        o5 --> note1
        oo6 --> note2
        o8 --> note3
        o7 --> note3.5
        o10.1 --> note3.6

    end

    note1 --> p64
    note2 --> p64
    note3 --> p64
    note3.5 --> p64
    note3.6 --> p64
    p64[BCFTOOLS_MERGECALLERS]
    %% p65[COMPRESS_INDEX_VCF_ALL_CALLERS]
    p64 --> note4
    subgraph lancet_confirm [  ]

        %% NOTE: There are many lancet confirm steps. First prep, then confirm, then re-merge.  
        %% p66[BEDTOOLS_STARTCANDIDATES]
        %% p67[GET_CANDIDATES]
        %% p68[COMPRESS_INDEX_VCF_REGION]
        %% p69[VCF_TO_BED]
        %% p70[LANCET_CONFIRM]
        %% p71[COMPRESS_INDEX_VCF_REGION_LANCET]
        %% p72[BCFTOOLS_INTERSECTVCFS]
        %% p73[RENAME_METADATA_LANCET]
        %% p74[MERGE_PREP_LANCET]
        %% p75[RENAME_VCF_LANCET]
        %% p76[COMPRESS_INDEX_VCF_LANCET]
        %% p77[BCFTOOLS_SPLITMULTIALLELIC_LANCET]
        %% p78[SPLIT_MNV_LANCET]
        %% p79[REMOVE_CONTIG]
        %% p80[GATK_SORTVCF_TOOLS_LANCET]
        %% p81[BCFTOOLS_MERGECALLERS_FINAL]
        %% p82[COMPRESS_INDEX_VCF_MERGED]
        %% p83[MERGE_COLUMNS]
        %% p84[ADD_NYGC_ALLELE_COUNTS]
        %% p85[ADD_FINAL_ALLELE_COUNTS]
        %% p86[FILTER_PON]
        %% p87[FILTER_VCF]
        %% p88[SNV_TO_MNV_FINAL_FILTER]
        %% p89[GATK_SORTVCF_SOMATIC]
        %% p90[REORDER_VCF_COLUMNS]
        %% p91[COMPRESS_INDEX_MERGED_VCF]
        note0[[The steps in this subgraph are truncated for clarity\nManta used as 'support' calls]]
        note4{{Extract non exonic variant calls}}
        note5{{Confirm non exonic variant calls with Lancet}}
        note6{{Merge Lancet confirmed to exonic calls}}
       
        note4 --> note5
        note5 --> note6
        %% note6 --> note7
    end

    subgraph snv_annotate [  ]
        p92[VEP_SOMATIC]
        p95[SNPSIFT_ANNOTATE_DBSNP_SOMATIC]
        p96[SOMATIC_VCF_FINALIZATION]
        o11([Annotated filtered\nSomatic SNV and InDELs Calls]):::output

        %% p105[:FILTER_BEDPE]
        %% p106[:FILTER_BEDPE_SUPPLEMENTAL]
        note6 --> p92
        p92 --> p95
        p95 --> p96
        p96 --> o11
        %% note2{{}}
        
    end

    subgraph cnv_sv_annotate [  ]
        o10 --> p97
        p97[ANNOTATE_DELLY_CNV]
        o12([Annotated CNV Regions]):::output
        o7 --> p98 
        o9 --> p98 
        o6 --> p98
        o10.1 --> p98
        p98[MERGE_SV]
        p99[ANNOTATE_SV]
        p100[ANNOTATE_SV_SUPPLEMENTAL]
        p101[ANNOTATE_GENES_SV]
        p102[ANNOTATE_GENES_SV_SUPPLEMENTAL]

        p103[ANNOTATE_SV_WITH_CNV]
        p104[ANNOTATE_SV_WITH_CNV_SUPPLEMENTAL]
        o13([Annotated SV Calls]):::output
        o14([Annotated SV]):::output

        p97 --> o12

        p98 --> p99
        p99 --> p100
        p100 --> p101
        p101 --> p102
        p102 --> o13

        p102 --> p103
        p97 --> p103
        p103 --> p104
        p104 --> o14
    end

    o11 ~~~ note42

    subgraph qc [  ]
        temp0((Fastq Files\nFrom Above))
        temp1((Genomic BAMs\nFrom Above))
        temp2((Logs from:\nBWA\nTrimming\nMark Duplicates))
        p03[FASTQC]
        p13[PICARD_COLLECTALIGNMENTSUMMARYMETRICS]
        p14[PICARD_COLLECTWGSMETRICS]
        p142[MULTIQC]
        o15([MultQC Report]):::output
        note42[[For clarity\nQC steps not connected to main graph]]
        temp0 --> p03
        temp1 --> p13
        temp1 --> p14
        temp2 --> p142
        p03 --> p142
        p13 --> p142
        p14 --> p142
        p142 --> o15
        %% n02 --> p03
        %% t02 --> p03
        %% to1 --> p13
        %% to1 --> p14
        %% no1 --> p13
        %% no1 --> p14
    end


classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style tumor stroke:#333,stroke-width:2px
style normal stroke:#333,stroke-width:2px
style germline stroke:#333,stroke-width:2px
style somatic_variant stroke:#333,stroke-width:2px
style lancet_confirm stroke:#333,stroke-width:2px
style snv_annotate stroke:#333,stroke-width:2px
style cnv_sv_annotate stroke:#333,stroke-width:2px

style qc stroke:#333,stroke-width:2px

Parameters for PTA Pipeline

--pubdir
- Default: /<PATH>
- Comment: The directory that the saved outputs will be stored.
--organize_by
- Default: sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
--cacheDir
- Default: /projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
-w
- Default: /<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
--csv_input
- Default: /<FILE_PATH>
- Comment: CSV delimited sample sheet that controls how samples are processed. The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. See note below on this page for additional information on file format.
--deduplicate_reads
- Default: false
- Comment: Options: false, true. If specified, run bbmap clumpify on input reads. Clumpify will deduplicate reads prior to trimming. This can help with mapping and downstream steps when analyzing high coverage WGS data.
--coverage_cap
- Default: null
- Comment: If an integer value is specified, jvarkit 'Biostar154220' is used to cap coverage at the that value. See: http://lindenb.github.io/jvarkit/Biostar154220.html
--primary_chrom_bed
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/Mus_musculus.GRCm39.dna.primary_assembly.bed'
- Comment: A bed file containing the primary chromsomes with positions. Used in limiting jvarkit 'Biostar154220' to those regions with expected coverage.
--split_fastq
- Default: false
- Comment: If specified, FASTQ files will be split into chunks sized based on split_fastq_bin_size prior to mapping. This option is recommended for high coverage data.
--split_fastq_bin_size
- Default: 10000000
- Comment: If split_fastq is specified, FASTQ files will splint into chunks of this size prior to mapping.
--quality_phred
- Default: 15
- Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
--unqualified_perc
- Default: 40
- Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
--detect_adapter_for_pe
- Default: false
- Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
--ref_fa
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis, points to human reference when --gen_org human. JAX users should not change this parameter.
--ref_fa_indices
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/indices/ensembl/v105/bwa/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: Pre-compiled BWA index files. JAX users should not change this parameter.
--ref_fa_dict
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.dict'
- Comment: FASTA dictonary file. JAX users should not change this parameter.
--combined_reference_set
- Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/combined_ref_set/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: Several tools (GRIDSS, SVABA) requires reference and bwa index files in same directory. Links used within this directory to avoid duplication of fasta and bwa indicies. See note in directory.
--mismatch_penalty
- Default: -B 8
- Comment: The BWA penalty for a mismatch.
--dbSNP
- Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/GCA_000001635.9_current_ids.vcf.g'
- Comment: Used in variant annotation and by SVABA. JAX users should not change this parameter.
--dbSNP_index
- Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/GCA_000001635.9_current_ids.vcf.gz.tbi'
- Comment: Index associated with the dbsnp file.
--chrom_contigs
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.primaryChr.contig_list'
- Comment: Contig list used for scatter / gather in calling and annotation.
--chrom_intervals
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_calling_intervals/'
- Comment: Chromosome intervals used for scatter gather in calling.
--call_val
- Default: 50
- Comment: The minimum phred-scaled confidence threshold at which variants should be called.
--ploidy_val
- Default: '-ploidy 2'
- Comment: Sample ploidy used by Haplotypecaller in germline small variant calling.
--excludeIntervalList
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mm39.excluderanges.interval_list'
- Comment: Germline caller exclusion list.
--intervalListBed
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla.interval_list.bed'
- Comment: This file is used to extract small variants in non-exonic regions. Such calls are then attempted to be recovered via Lancet calls.
--lancet_beds_directory
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/lancet_chr_beds/'
- Comment: Lancet interval bed files used in calling by that tool.
--delly_exclusion
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_gap_delly_exclusion.txt'
- Comment: Delly CNV calling exclusion list.
--delly_mappability
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mappability/GRCm39.p0.map.gz'
- Comment: Delly CNV calling mappability file.
--cnv_window
- Default: 10000
- Comment: Delly CNV calling read depth window size. Default value is tool default. This parameter is included for testing purposes only.
--cnv_min_size
- Default: 10000
- Comment: Delly CNV classification minimum CNV size. Default value is tool default. This parameter is included for testing purposes only.
--cnv_germline_prob
- Default: 0.00100000005
- Comment: Delly CNV classification germline probability. Default value is tool default. This parameter is included for testing purposes only.
--callRegions
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39.callregions.bed.gz'
- Comment: Manta calling regions. Provided by the tool developer resource pack.
--strelka_config
- Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/configs/configureStrelkaSomaticWorkflow.py.ini'
- Comment: Strelka input configuration. Provided by the tool developer resource pack.
--vep_cache_directory
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/vep_data'
- Comment: VEP annotation cache. Cache provided is for Ensembl v109.
--vep_fasta
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: VEP requires an ensembl based fasta. GRCh38.p13 is used for v97-v109.
--cytoband
- Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/annotations/GRCm38.liftedTo.GRCm39.cytoBand.UCSC.chr.sorted.bed'
- Comment: Cytoband file used in CNV annotations. Derived from UCSC table, lifted from GRCm38 to GRCm39.
--known_del
- Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_DEL_mm39_sorted.bed'
- Comment: Used in SV annotation, and filtering. Deletion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
--known_ins
- Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_INS_mm39_sorted.bed'
- Comment: Used in SV annotation, and filtering. Insertion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
--known_inv
- Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_INV_mm39_sorted.bed'
- Comment: Used in SV annotation, and filtering. Inversion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
--ensemblUniqueBed
- Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/annotations/ensembl_genes_unique_sorted.final.v110.chr.sorted.bed'
- Comment: File used in CNV and SV annotation.
--gap
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_gap.bed'
- Comment: File used in SV annotation. From UCSC table browser.
--exclude_list
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mm39.excluderanges_cleaned.bed'
- Comment: File used in SV annotation. From: https://dozmorovlab.github.io/excluderanges/.
--proxy_normal_bam
- Default: '/projects/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/C57L_J/C57L_J_dedup.bam'
- Comment: Proxy BAM file. Used in un-paired sample analysis. C57L_J at 30x is used by default.
--proxy_normal_bai
- Default: '/projects/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/C57L_J/C57L_J_dedup.bam.bai'
- Comment: Proxy BAM index file. Used in un-paired sample analysis. C57L_J at 30x is used by default.
--proxy_normal_sampleName
- Default: 'C57L_J'
- Comment: Proxy sample name within the proxy BAM file. C57L_J used by default.
--read_type
- Default: PE
- Comment: Only 'PE' is accepted for this workflow.

CSV Input Sample Sheet

The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

The patient column defines how samples are paired. All combinations of normal and tumor samples that share the same patient ID will be paired.
The sex column is unused in the workflow at this time.
The status column defines if each sample is either 'normal': 0 or 'tumor': 1.
The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
The fastq_1 and fastq_2 columns must contain absolute paths to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples: