Skip to content

Mouse PTA ReadMe

MikeWLloyd edited this page Apr 23, 2024 · 14 revisions

Paired Tumor Analysis (PTA) Documentation

Paired Tumor Analysis Pipeline (--workflow pta, --gen_org mouse)

For all input samples:

•   Fastp read quality and adapter trimming   
•   Get Read Group Information   
•   BWA-MEM Alignment   
•   Picard SortSam and Mark Duplicates    
•   Collect Alignment Summary Metrics   

If paired sample:

•   Germline variant calling  
•   Germline variant filtering   
•   Germline variant annotation   

For all samples (C57L_J used if unpaired):

•	GATK Mutect2 variant calling  
•	Filter Mutect2 calls  
•	Lancet variant calling  
•	Svaba SV calling   
•	Manta SV calling   
•	Strelka2 SNV and SV calling   
•	Lumpy SV calling   
•	Delly SV calling   
•	Delly CNV calling   
•	VCF Merge prep steps. A number of scripts are applied to ensure samples within VCF files are properly oriented prior to merging, and calls are properly formatted   
•	Intergenic variant rescue and confirmation via Lancet. A number of scripts are applied to determine if intergenic variants are supported by the caller Lancet   
•	SNV VCF files are merged across all SNV callers  
•	VEP annotation of merged SNVs  
•	Bicseq2 CNV calls are annotated   
•	SV calls are merged across callers  
•	SV calls are annotated with known insertion, deletion, transversions and exclusion regions. Annotation is done at 80% overlap between called SV event and known event size.       
•	SV calls are annotated with CNV regions  
•	A final filter is applied to CNV annotated SV calls  
•	MultiQC report generation 
flowchart TD
    p00((CSV Sample Sheet))
    p01[PARSE_SAMPLE_SHEET:CONCATENATE_READS]
    p00 --> p01
    p01 --> |Tumor Sample| t02
    p01 --> |"`Normal Sample:
               **If Normal Sample 
               Provided**`"| n02


    subgraph tumor [  ]
        t02[FASTP]
        t03[BWA_MEM]
        t04[PICARD_SORTSAM]
        t05[PICARD_MARKDUPLICATES]
        %% NOTE: BaseRecalibrator and BQSR requires known sites to recal around.

        to1([Tumor Genomic Bam]):::output
        t02 --> t03
        t03 --> t04
        t04 --> t05
        t05 --> to1
        
    end

    subgraph normal [  ]
        n02[FASTP]
        n03[BWA_MEM]
        n04[PICARD_SORTSAM]
        n05[PICARD_MARKDUPLICATES]
        %% REMOVE DUPE
        
        %% Indel realigner from GATK

        %% NOTE: BaseRecalibrator and BQSR requires known sites to recal around.
        no1([Normal Genomic Bam]):::output

        n02 --> n03
        n03 --> n04
        n04 --> n05
        n05 --> no1
    end


    no1 -..-> |If Normal\nSample Provided| m1
    altBAM -..-> |If No Normal\nSample Provided| m1
    to1 --> m1
    altBAM[(ALT\nBAM)]
    m1 --> p35
    m1 --> p39
    m1 --> p41 
    m1 --> p42
    m1 --> p43 
    m1 --> p44
    m1 --> p54
    m1 --> p57.1
    m1((Join:\nTumor & Normal))


    subgraph germline [  ]
        no1 -..-> |If Normal\nSample Provided|p20
        p20[GATK_HAPLOTYPECALLER_SV_GERMLINE]
        p21[GATK_SORTVCF_GERMLINE]
        p27[BCFTOOLS_GERMLINE_FILTER]
        %% p28[BCFTOOLS_SPLITMULTIALLELIC_REGIONS]
        %% p29[VEP_GERMLINE]
        %% p30[BCFTOOLS_REMOVESPANNING]
        %% p33[SNPSIFT_ANNOTATE_DBSNP_GERMLINE]
        %% p34[GERMLINE_VCF_FINALIZATION]
        annot_summary1{{Germline Annotation via VEP\nSteps truncated for figure clarity}}

        o3([Germline Variants]):::output
        o4([Annotated Germline Variants]):::output

        p20 --> p21
        p21 --> p27
        p27 --> o3
        o3 --> annot_summary1
        subgraph germline_annot [  ]
            %% p27 --> p28
            %% p28 --> p29
            %% p29 --> p30
            %% p30 --> p31
            %% p31 --> p32 
            %% p32 --> p33 
            %% p33 --> p34
            %% p34 --> o4
            annot_summary1 --> o4
        end
    end



    subgraph somatic_variant [  ]

        p35[GATK_MUTECT2]
        p36[GATK_SORTVCF_MUTECT]
        p37[GATK_MERGEMUTECTSTATS]
        p38[GATK_FILTERMUECTCALLS]
        o5([Mutect2 SNV Calls]):::output

        p44[LANCET]
        p45[GATK_SORTVCF_LANCET]
        oo6([Lancet SNV Calls]):::output


        p39[DELLY_SOMATIC]
        p40[DELLY_FILTER_SOMATIC]

        o6([DELLY SV Calls]):::output

        p41[MANTA]
        o7([Manta SV and SNV Calls]):::output

        p42[STRELKA2]
        o8([Strelka2 SNV Calls]):::output

        p43[SMOOVE]
        o9([SMOOVE_Lumpy SV Calls]):::output

        p54[DELLY_CNV_SOMATIC]

        p55[BCFTOOLS_MERGE_DELLY_CNV]
        p56[DELLY_CLASSIFY]
        p57[BCFTOOLS_QUERY_DELLY_CNV]
        o10([DELLY CNV Calls]):::output

        p57.1[SVABA]
        o10.1([SVABA SV and SNV Calls]):::output




        p35 --> p36
        p36 --> p37 
        p37 --> p38
        p38 --> o5
        
        p39 --> p40
        p40 --> o6 

        p41 --> o7

        o7 --> p42
        p42 --> o8

        p43 --> o9

        p44 --> p45
        p45 --> oo6

        p54 --> p55
        p55 --> p56
        p56 --> p57
        p57 --> o10

        p57.1 --> o10.1


        note1{{Merge prep.\nSteps truncated}}
        note2{{Merge prep.\nSteps truncated}}
        note3{{Merge prep.\nSteps truncated}}
        note3.5{{Merge prep.\nSteps truncated}}
        note3.6{{Merge prep.\nSteps truncated}}
        %% p57[RENAME_METADATA]
        %% p58[MERGE_PREP]
        %% p59[RENAME_VCF]
        %% p60[COMPRESS_INDEX_VCF]
        %% p61[BCFTOOLS_SPLITMULTIALLELIC]
        %% p62[SPLIT_MNV]
        %% p63[GATK_SORTVCF_TOOLS]

        o5 --> note1
        oo6 --> note2
        o8 --> note3
        o7 --> note3.5
        o10.1 --> note3.6

    end

    note1 --> p64
    note2 --> p64
    note3 --> p64
    note3.5 --> p64
    note3.6 --> p64
    p64[BCFTOOLS_MERGECALLERS]
    %% p65[COMPRESS_INDEX_VCF_ALL_CALLERS]
    p64 --> note4
    subgraph lancet_confirm [  ]

        %% NOTE: There are many lancet confirm steps. First prep, then confirm, then re-merge.  
        %% p66[BEDTOOLS_STARTCANDIDATES]
        %% p67[GET_CANDIDATES]
        %% p68[COMPRESS_INDEX_VCF_REGION]
        %% p69[VCF_TO_BED]
        %% p70[LANCET_CONFIRM]
        %% p71[COMPRESS_INDEX_VCF_REGION_LANCET]
        %% p72[BCFTOOLS_INTERSECTVCFS]
        %% p73[RENAME_METADATA_LANCET]
        %% p74[MERGE_PREP_LANCET]
        %% p75[RENAME_VCF_LANCET]
        %% p76[COMPRESS_INDEX_VCF_LANCET]
        %% p77[BCFTOOLS_SPLITMULTIALLELIC_LANCET]
        %% p78[SPLIT_MNV_LANCET]
        %% p79[REMOVE_CONTIG]
        %% p80[GATK_SORTVCF_TOOLS_LANCET]
        %% p81[BCFTOOLS_MERGECALLERS_FINAL]
        %% p82[COMPRESS_INDEX_VCF_MERGED]
        %% p83[MERGE_COLUMNS]
        %% p84[ADD_NYGC_ALLELE_COUNTS]
        %% p85[ADD_FINAL_ALLELE_COUNTS]
        %% p86[FILTER_PON]
        %% p87[FILTER_VCF]
        %% p88[SNV_TO_MNV_FINAL_FILTER]
        %% p89[GATK_SORTVCF_SOMATIC]
        %% p90[REORDER_VCF_COLUMNS]
        %% p91[COMPRESS_INDEX_MERGED_VCF]
        note0[[The steps in this subgraph are truncated for clarity\nManta used as 'support' calls]]
        note4{{Extract non exonic variant calls}}
        note5{{Confirm non exonic variant calls with Lancet}}
        note6{{Merge Lancet confirmed to exonic calls}}
       
        note4 --> note5
        note5 --> note6
        %% note6 --> note7
    end

    subgraph snv_annotate [  ]
        p92[VEP_SOMATIC]
        p95[SNPSIFT_ANNOTATE_DBSNP_SOMATIC]
        p96[SOMATIC_VCF_FINALIZATION]
        o11([Annotated filtered\nSomatic SNV and InDELs Calls]):::output

        %% p105[:FILTER_BEDPE]
        %% p106[:FILTER_BEDPE_SUPPLEMENTAL]
        note6 --> p92
        p92 --> p95
        p95 --> p96
        p96 --> o11
        %% note2{{}}
        
    end

    subgraph cnv_sv_annotate [  ]
        o10 --> p97
        p97[ANNOTATE_DELLY_CNV]
        o12([Annotated CNV Regions]):::output
        o7 --> p98 
        o9 --> p98 
        o6 --> p98
        o10.1 --> p98
        p98[MERGE_SV]
        p99[ANNOTATE_SV]
        p100[ANNOTATE_SV_SUPPLEMENTAL]
        p101[ANNOTATE_GENES_SV]
        p102[ANNOTATE_GENES_SV_SUPPLEMENTAL]

        p103[ANNOTATE_SV_WITH_CNV]
        p104[ANNOTATE_SV_WITH_CNV_SUPPLEMENTAL]
        o13([Annotated SV Calls]):::output
        o14([Annotated SV]):::output

        p97 --> o12

        p98 --> p99
        p99 --> p100
        p100 --> p101
        p101 --> p102
        p102 --> o13

        p102 --> p103
        p97 --> p103
        p103 --> p104
        p104 --> o14
    end

    o11 ~~~ note42

    subgraph qc [  ]
        temp0((Fastq Files\nFrom Above))
        temp1((Genomic BAMs\nFrom Above))
        temp2((Logs from:\nBWA\nTrimming\nMark Duplicates))
        p03[FASTQC]
        p13[PICARD_COLLECTALIGNMENTSUMMARYMETRICS]
        p14[PICARD_COLLECTWGSMETRICS]
        p142[MULTIQC]
        o15([MultQC Report]):::output
        note42[[For clarity\nQC steps not connected to main graph]]
        temp0 --> p03
        temp1 --> p13
        temp1 --> p14
        temp2 --> p142
        p03 --> p142
        p13 --> p142
        p14 --> p142
        p142 --> o15
        %% n02 --> p03
        %% t02 --> p03
        %% to1 --> p13
        %% to1 --> p14
        %% no1 --> p13
        %% no1 --> p14
    end


classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style tumor stroke:#333,stroke-width:2px
style normal stroke:#333,stroke-width:2px
style germline stroke:#333,stroke-width:2px
style somatic_variant stroke:#333,stroke-width:2px
style lancet_confirm stroke:#333,stroke-width:2px
style snv_annotate stroke:#333,stroke-width:2px
style cnv_sv_annotate stroke:#333,stroke-width:2px

style qc stroke:#333,stroke-width:2px
Loading

Parameters for PTA Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: /projects/omics_share/meta/containers
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --csv_input

    • Default: /<FILE_PATH>
    • Comment: CSV delimited sample sheet that controls how samples are processed. The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. See note below on this page for additional information on file format.
  • --deduplicate_reads

    • Default: false
    • Comment: Options: false, true. If specified, run bbmap clumpify on input reads. Clumpify will deduplicate reads prior to trimming. This can help with mapping and downstream steps when analyzing high coverage WGS data.
  • --coverage_cap

  • --primary_chrom_bed

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/Mus_musculus.GRCm39.dna.primary_assembly.bed'
    • Comment: A bed file containing the primary chromsomes with positions. Used in limiting jvarkit 'Biostar154220' to those regions with expected coverage.
  • --split_fastq

    • Default: false
    • Comment: If specified, FASTQ files will be split into chunks sized based on split_fastq_bin_size prior to mapping. This option is recommended for high coverage data.
  • --split_fastq_bin_size

    • Default: 10000000
    • Comment: If split_fastq is specified, FASTQ files will splint into chunks of this size prior to mapping.
  • --quality_phred

    • Default: 15
    • Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
  • --unqualified_perc

    • Default: 40
    • Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
  • --detect_adapter_for_pe

    • Default: false
    • Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
  • --ref_fa

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.fa'
    • Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis, points to human reference when --gen_org human. JAX users should not change this parameter.
  • --ref_fa_indices

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/indices/ensembl/v105/bwa/Mus_musculus.GRCm39.dna.primary_assembly.fa'
    • Comment: Pre-compiled BWA index files. JAX users should not change this parameter.
  • --ref_fa_dict

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.dict'
    • Comment: FASTA dictonary file. JAX users should not change this parameter.
  • --combined_reference_set

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/combined_ref_set/Mus_musculus.GRCm39.dna.primary_assembly.fa'
    • Comment: Several tools (GRIDSS, SVABA) requires reference and bwa index files in same directory. Links used within this directory to avoid duplication of fasta and bwa indicies. See note in directory.
  • --mismatch_penalty

    • Default: -B 8
    • Comment: The BWA penalty for a mismatch.
  • --dbSNP

    • Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/GCA_000001635.9_current_ids.vcf.g'
    • Comment: Used in variant annotation and by SVABA. JAX users should not change this parameter.
  • --dbSNP_index

    • Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/GCA_000001635.9_current_ids.vcf.gz.tbi'
    • Comment: Index associated with the dbsnp file.
  • --chrom_contigs

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.primaryChr.contig_list'
    • Comment: Contig list used for scatter / gather in calling and annotation.
  • --chrom_intervals

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_calling_intervals/'
    • Comment: Chromosome intervals used for scatter gather in calling.
  • --call_val

    • Default: 50
    • Comment: The minimum phred-scaled confidence threshold at which variants should be called.
  • --ploidy_val

    • Default: '-ploidy 2'
    • Comment: Sample ploidy used by Haplotypecaller in germline small variant calling.
  • --excludeIntervalList

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mm39.excluderanges.interval_list'
    • Comment: Germline caller exclusion list.
  • --intervalListBed

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla.interval_list.bed'
    • Comment: This file is used to extract small variants in non-exonic regions. Such calls are then attempted to be recovered via Lancet calls.
  • --lancet_beds_directory

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/lancet_chr_beds/'
    • Comment: Lancet interval bed files used in calling by that tool.
  • --delly_exclusion

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_gap_delly_exclusion.txt'
    • Comment: Delly CNV calling exclusion list.
  • --delly_mappability

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mappability/GRCm39.p0.map.gz'
    • Comment: Delly CNV calling mappability file.
  • --cnv_window

    • Default: 10000
    • Comment: Delly CNV calling read depth window size. Default value is tool default. This parameter is included for testing purposes only.
  • --cnv_min_size

    • Default: 10000
    • Comment: Delly CNV classification minimum CNV size. Default value is tool default. This parameter is included for testing purposes only.
  • --cnv_germline_prob

    • Default: 0.00100000005
    • Comment: Delly CNV classification germline probability. Default value is tool default. This parameter is included for testing purposes only.
  • --callRegions

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39.callregions.bed.gz'
    • Comment: Manta calling regions. Provided by the tool developer resource pack.
  • --strelka_config

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/configs/configureStrelkaSomaticWorkflow.py.ini'
    • Comment: Strelka input configuration. Provided by the tool developer resource pack.
  • --vep_cache_directory

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/vep_data'
    • Comment: VEP annotation cache. Cache provided is for Ensembl v109.
  • --vep_fasta

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.fa'
    • Comment: VEP requires an ensembl based fasta. GRCh38.p13 is used for v97-v109.
  • --cytoband

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/annotations/GRCm38.liftedTo.GRCm39.cytoBand.UCSC.chr.sorted.bed'
    • Comment: Cytoband file used in CNV annotations. Derived from UCSC table, lifted from GRCm38 to GRCm39.
  • --known_del

    • Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_DEL_mm39_sorted.bed'
    • Comment: Used in SV annotation, and filtering. Deletion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
  • --known_ins

    • Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_INS_mm39_sorted.bed'
    • Comment: Used in SV annotation, and filtering. Insertion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
  • --known_inv

    • Default: '/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_INV_mm39_sorted.bed'
    • Comment: Used in SV annotation, and filtering. Inversion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
  • --ensemblUniqueBed

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/annotations/ensembl_genes_unique_sorted.final.v110.chr.sorted.bed'
    • Comment: File used in CNV and SV annotation.
  • --gap

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_gap.bed'
    • Comment: File used in SV annotation. From UCSC table browser.
  • --exclude_list

  • --proxy_normal_bam

    • Default: '/projects/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/C57L_J/C57L_J_dedup.bam'
    • Comment: Proxy BAM file. Used in un-paired sample analysis. C57L_J at 30x is used by default.
  • --proxy_normal_bai

    • Default: '/projects/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/C57L_J/C57L_J_dedup.bam.bai'
    • Comment: Proxy BAM index file. Used in un-paired sample analysis. C57L_J at 30x is used by default.
  • --proxy_normal_sampleName

    • Default: 'C57L_J'
    • Comment: Proxy sample name within the proxy BAM file. C57L_J used by default.
  • --read_type

    • Default: PE
    • Comment: Only 'PE' is accepted for this workflow.

CSV Input Sample Sheet

The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The patient column defines how samples are paired. All combinations of normal and tumor samples that share the same patient ID will be paired.
  • The sex column is unused in the workflow at this time.
  • The status column defines if each sample is either 'normal': 0 or 'tumor': 1.
  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example paired analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In the example case above the following output directories will be generated:

SAMPLE_42--NORMAL_1: Contains all NORMAL_1 sample specific files
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

An example paired analysis with multiple lanes:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE1_R2.fastq.gz
SAMPLE_42,XX,0,NORMAL_1,L2,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE2_R2.fastq.gz
SAMPLE_42,XX,0,NORMAL_1,L3,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE3_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In the example case above the three lanes provided for the normal sample will be concatenated and the concatenated reads will be passed forward for analysis. Samples with a single lane will be passed forward for analysis. A mix of samples with multiple lanes, and single lanes can be provided.

The following output directories will be generated:

SAMPLE_42--NORMAL_1: Contains all NORMAL_1 sample specific files
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

An example unpaired analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

Note: In cases when tumor is provided without matched normal, a proxy normal sample is used in somatic small variant, somatic structural variant calling, and CNV calling. Germline calling on the proxy normal sample is not done. A mix of samples with and without pairs can also be provided. By default a C57L_J sample at 30x coverage is used. It is assumed in the following document that C57L_J was used.

The output directory structure of tumor only samples will be as follows:

SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--C57L_J: Contains all TUMOR_1 by proxy C57L_J specific files

Additional information on outputs is provided below.

An example of mixed paired and unpaired analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_101,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_101,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

The output directory structure of samples will be as follows:

SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--C57L_J: Contains all TUMOR_1 by C57L_J specific files
PATIENT_101--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_101--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

Note: In cases when tumor is provided without matched normal, a proxy normal sample is used in somatic small variant, somatic structural variant calling, and CNV calling. Germline calling on the proxy normal sample is not done. A mix of samples with and without pairs can also be provided. By default a C57L_J sample at 30x coverage is used. It is assumed in the following document that C57L_J was used.

The output directory structure of tumor only samples will be as follows:

SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--C57L_J: Contains all TUMOR_1 by C57L_J specific files

Additional information on outputs is provided below.

Many samples for one patient:

The workflow supports the mapping on one to many, many to one, and many to many normal and tumor samples.

An example one to many analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_2,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_3,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In cases of one to many, many to one, and many to many all combinations of samples will be processes against one another.

In the example case above the following output directories will be generated:

SAMPLE_42--NORMAL_1: Contains all NORMAL_1 sample specific files
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_2: Contains all TUMOR_2 specific files
SAMPLE_42--TUMOR_3: Contains all TUMOR_3 specific files
SAMPLE_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
SAMPLE_42--TUMOR_2--NORMAL_1: Contains all TUMOR_2 by NORMAL_1 specific files
SAMPLE_42--TUMOR_3--NORMAL_1: Contains all TUMOR_3 by NORMAL_1 specific files

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

NOTE: All files contained in 'stats' directories are captured by MultiQC reports.

The pipelines will output several directories relative to files that apply to individual sample or combinations of samples.

Following the example naming in the csv section above for "an example paired analysis":

Normal specific results:

SAMPLE_42--NORMAL_1: Contains all NORMAL_1 sample specific files

Naming Convention Description
*_haplotypecaller.gatk.filtered.vcf.gz Final filtered SNP and InDEL calls from haplotypecaller.
*_germline_snv_indel_annotated_filtered_final.vcf Final filtered SNP and InDEL calls from haplotypecaller with VEP annotations.
bam/*_dedup.bam Final duplicate marked bam file used in calling.
bam/*_dedup.bai Bam index file.
stats/*_stat BWA alignment metrics.
stats/*_AlignmentMetrics.txt GATK Alignment metrics.
stats/*_CollectWgsMetrics.txt GATK collect WGS metrics output.
stats/*_dup_metrics.txt Picard mark duplicates output.
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.

NOTE: When tumor-only samples are run, there will be no <PATIENT>--C57L_J directory output. As all files associated with C57L_J specifically are not relevant.

Tumor specific results:

SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files

Naming Convention Description
bam/*_dedup.bam Final duplicate marked, BQSR realigned bam file used in calling.
bam/*_dedup.bai Bam index file.
stats/*_stat BWA alignment metrics.
stats/*_AlignmentMetrics.txt GATK Alignment metrics.
stats/*_CollectWgsMetrics.txt GATK collect WGS metrics output.
stats/*_dup_metrics.txt Picard mark duplicates output.
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.

Paired sample results:

SAMPLE_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Naming Convention Description
*_cnv_annotated_final.bed Final CNV calls restricted to high confidence and with provided with annotations.
*_cnv_annotated_supplemental.bed All CNV calls with annotations.
*_somatic_snv_indel_annotated_filtered_final.vcf Final filtered somatic SNVs and InDELs based on Mutect2, Strelka, Svaba, and supported by Lancet.
*_somatic_snv_indel_annotated_filtered_supplemental.vcf Supplementary information from filtered somatic SNVs and InDELs based on Mutect2, Strelka, Svaba, and supported by Lancet.
*_somatic_snv_indel_annotated_filtered_final.txt Text extraction from the VCF filtered somatic SNVs and InDELs based on Mutect2, Strelka, Svaba, and supported by Lancet.
*_manta_lumpy_delly_svaba_sv_annotated_genes_cnv.bedpe Somatic structural variant calls pre-filtering.
*_manta_lumpy_delly_svaba_sv_annotated_genes_cnv_supplemental.bedpe Supplementary somatic structural variant information for pre-filtered calls.
*_sv_annotated_somatic_final.bedpe Somatic structural variant calls with polished annotations.
*_sv_annotated_somatic_high_confidence_final.bedpe Somatic structural variant calls restricted to high confidence calls with polished annotations.
*_sv_annotated_somatic_supplemental.bedpe Somatic structural variant calls with all annotations.
*_sv_annotated_somatic_high_confidence_supplemental.bedpe Somatic structural variant calls restricted to high confidence calls with all annotations.
callers/*_delly_somatic_cnv_classified.bcf Raw Delly CNV classification in BCF format.
callers/*_delly_somatic_cnv_segmentation.bed Delly CNV segmentation regions in BED format, converted from BCF.
cnv_plots/*.png Delly CNV plots by chromosome and genome wide.
callers/*_delly_filtered_somaticSV.vcf.gz
callers/*_lancet_merged.vcf.gz Lancet raw SNP/InDEL calls.
callers/*_manta_candidateSmallIndels.vcf.gz Manta raw small indel calls.
callers/*_manta_candidateSV.vcf.gz Manta raw candidate SV calls.
callers/*_manta_diploidSV.vcf.gz Manta raw diploid SV calls.
callers/*_manta_somaticSV.vcf.gz Manta raw somatic SV calls, these are the calls that are merged with other SV callers.
callers/*_mutect2_somatic.filtered.vcf.gz Mutect2 calls filtered by GATK 'filtermutectcalls'.
callers/*-smoove.genotyped.vcf.gz Smoove (Lumpy) raw SV calls.
callers/*_strelka_somatic.indels.vcf.gz Strelka raw InDEL calls.
callers/*_strelka_somatic.snvs.vcf.gz Strelka raw snv calls.

Additional result output:

Naming Convention Description
pta_report.html Nextflow autogenerated report.
trace Nextflow autogenerated trace report for resource usage in tabular text format.
multiqc MultiQC report summarizing quality metrics across samples in the analysis run.

Pipeline Options Outputs

If the workflow is run with --keep_intermediate true additional outputs will be saved out. This option is only recommended for debugging purposes.

Clone this wiki locally