Skip to content

Human PTA ReadMe

MikeWLloyd edited this page May 7, 2024 · 13 revisions

Paired Tumor Analysis (PTA) Documentation

Paired Tumor Analysis Pipeline (--workflow pta, --gen_org human)

For all input samples:

•   Fastp read quality and adapter trimming   
•   Get Read Group Information   
•   BWA-MEM Alignment   
•   Picard SortSam and Mark Duplicates   
•   Base Recalibrator and Apply BQSR   
•   Collect Alignment Summary Metrics   

If paired sample:

•   Conpair for sample contamination analysis   
•   Germline variant calling  
•   Germline variant filtering   
•   Germline variant annotation   

For all samples (NA12878 used if unpaired):

•	GATK Mutect2 variant caller  
•	Filter Mutect2 calls  
•	Lancet variant caller  
•	Manta SV caller   
•	Strelka2 SNV and SV caller   
•	Gridss SV caller   
•	Gripss SV filter of Gridss calls  
•	Bicseq2 sample normalization for normal and tumor  
•	Bicseq2 segmentation CNV calling (paired for paired, or on single sample for unpaired)  
•	MSIsensor2 MSI calling  
•	VCF Merge prep steps. A number of scripts are applied to ensure samples within VCF files are properly oriented prior to merging, and calls are properly formatted   
•	Intergenic variant rescue and confirmation via Lancet. A number of scripts are applied to determine if intergenic variants are supported by the caller Lancet   
•	SNV VCF files are merged across all SNV callers  
•	Panel of normal filtering is applied  
•	VEP annotation of merged SNVs  
•	COSMIC cancer resistance mutations, and COSMIC cancer gene annotations are added to merged SNV calls  
•	Bicseq2 CNV calls are annotated   
•	SV calls are merged across callers  
•	SV calls are annotated   
•	SV calls are annotated with CNV regions  
•	A final filter is applied to CNV annotated SV calls  
•	MultiQC report generation 
flowchart TD
    p00((CSV Sample Sheet))
    p01[PARSE_SAMPLE_SHEET:CONCATENATE_READS]
    p00 --> p01
    p01 --> |Tumor Sample| t02
    p01 --> |"`Normal Sample:
               **If Normal Sample 
               Provided**`"| n02


    subgraph tumor [  ]

        t02[FASTP]
        opt1[XENGSORT_CLASSIFY]
        t05[BWA_MEM]
        t06[PICARD_SORTSAM]
        t07[SHORT_ALIGNMENT_MARKING]
        t08[PICARD_CLEANSAM]
        t09[PICARD_FIX_MATE_INFORMATION]
        t10[PICARD_MARKDUPLICATES]
        t11[GATK_BASERECALIBRATOR]
        t12[GATK_APPLYBQSR]
        %% t16[GATK_GETSAMPLENAME_TUMOR]
        t01([Tumor Genomic Bam]):::output

        t02 -..-> |PDX Sample| opt1

        t02 --> |Human Sample| t05
        opt1 -..-> |Human Reads| t05
        t05 --> t06
        t06 --> t07
        t07 --> t08
        t08 --> t09
        t09 --> t10
        t10 --> t11
        t11 --> t12
        t12 --> t01
        
    end

    subgraph normal [  ]
        n02[FASTP]
        n05[BWA_MEM]
        n06[PICARD_SORTSAM]
        n07[SHORT_ALIGNMENT_MARKING]
        n08[PICARD_CLEANSAM]
        n09[PICARD_FIX_MATE_INFORMATION]
        n10[PICARD_MARKDUPLICATES]
        n11[GATK_BASERECALIBRATOR]
        n12[GATK_APPLYBQSR]
        %% n15[GATK_GETSAMPLENAME_NORMAL]
        n01([Normal Genomic Bam]):::output

        n02 --> n05
        n05 --> n06
        n06 --> n07
        n07 --> n08
        n08 --> n09
        n09 --> n10
        n10 --> n11
        n11 --> n12
        n12 --> n01
    end

    subgraph germline [  ]
        n01 -..-> |If Normal\nSample Provided|p20
        p20[GATK_HAPLOTYPECALLER_SV_GERMLINE]
        p21[GATK_SORTVCF_GERMLINE]
        p22[GATK_GENOTYPE_GVCF]
        p23[GATK_CNNSCORE_VARIANTS]
        p24[GATK_SORTVCF_GENOTYPE]
        p25[GATK_FILTER_VARIANT_TRANCHES]

        p26[GATK_VARIANTFILTRATION_AF]
        p27[BCFTOOLS_GERMLINE_FILTER]
        %% p28[BCFTOOLS_SPLITMULTIALLELIC_REGIONS]
        %% p29[VEP_GERMLINE]
        %% p30[BCFTOOLS_REMOVESPANNING]
        %% p31[COSMIC_ANNOTATION]
        %% p32[COSMIC_CANCER_RESISTANCE_MUTATION_GERMLINE]
        %% p33[SNPSIFT_ANNOTATE_DBSNP_GERMLINE]
        %% p34[GERMLINE_VCF_FINALIZATION]
        annot_summary1{{Germline Annotation via VEP and COSMIC\nSteps truncated for figure clarity}}

        o3([Germline Variants]):::output
        o4([Annotated Germline Variants]):::output

        p20 --> p21
        p21 --> p22
        p22 --> p23
        p23 --> p24
        p24 --> p25
        p25 --> p26
        p26 --> p27
        p27 --> o3
        o3 --> annot_summary1
        subgraph germline_annot [  ]
            %% p27 --> p28
            %% p28 --> p29
            %% p29 --> p30
            %% p30 --> p31
            %% p31 --> p32 
            %% p32 --> p33 
            %% p33 --> p34
            %% p34 --> o4
            annot_summary1 --> o4
        end
    end



    n01 -..-> |If Normal\nSample Provided| m1
    altBAM -..-> |If No Normal\nSample Provided| m1
    t01 --> m1
    altBAM[(NA12878\nBAM)]
    m1 --> p35
    m1 --> p39
    m1 --> p41 
    m1 --> p42
    m1 --> p43 
    %% m1 --> p52 
    %% m1 --> p53 
    m1 --> summary1
    m1((Join:\nTumor & Normal))
    %% m1 -..-> |If Normal\nSample Provided| p17
    %% m1 -..-> |If Normal\nSample Provided| p18
    m1 -..-> |If Normal\nSample Provided| summary2

    subgraph conpair [  ]

        %% p17[CONPAIR_NORMAL_PILEUP]
        %% p18[CONPAIR_TUMOR_PILEUP]
        summary2{{CONPAIR: Tumore and Normal Pileups}}
        p19[CONPAIR]
        o2([Conpair sample contam. results]):::output

        %% p17 --> p19
        %% p18 --> p19
        summary2 --> p19
        p19 --> o2
    end


    subgraph somatic_variant [  ]

        p35[GATK_MUTECT2]
        p36[GATK_SORTVCF_MUTECT]
        p37[GATK_MERGEMUTECTSTATS]
        p38[GATK_FILTERMUECTCALLS]
        o5([Mutect2 SNV Calls]):::output

        p39[LANCET]
        p40[GATK_SORTVCF_LANCET]
        o6([Lancet SNV Calls]):::output

        p41[MANTA]
        o7([Manta SV and SNV Calls]):::output

        p42[STRELKA2]
        o8([Strelka2 SNV Calls]):::output

        p43[GRIDSS_PREPROCESS]
        p44[GRIDSS_ASSEMBLE]
        p45[GRIDSS_CALLING]
        p46[GRIDSS_CHROM_FILTER]
        p47[GRIPSS_SOMATIC_FILTER]
        o9([Gridss SNV Calls]):::output

        %% p48[SAMTOOLS_STATS_INSERTSIZE_NORMAL]
        %% p49[SAMTOOLS_STATS_INSERTSIZE_TUMOR]
        %% p50[SAMTOOLS_FILTER_UNIQUE_NORMAL]
        %% p51[SAMTOOLS_FILTER_UNIQUE_TUMOR]
        %% p52[BICSEQ2_NORMALIZE_NORMAL]
        %% p53[BICSEQ2_NORMALIZE_TUMOR]
        summary1{{BICSEQ2\nPreprocessing Steps Truncated for Clarity}}
        p54[BICSEQ2_SEG]
        p55[BICSEQ2_SEG_UNPAIRED]
        o10([BICSEQ2 CNV Calls]):::output

        p35 --> p36
        p36 --> p37 
        p37 --> p38
        p38 --> o5
        
        p39 --> p40 
        p40 --> o6

        p41 --> o7

        o7 --> p42
        p42 --> o8

        p43 --> p44
        p44 --> p45 
        p45 --> p46 
        p46 --> p47
        p47 --> o9

        summary1 --> p54 
        summary1 -..-> |If No Normal\nSample Provided| p55

        p54 --> o10
        p55 --> o10

        note1{{Merge prep.\nSteps truncated}}
        note2{{Merge prep.\nSteps truncated}}
        note3{{Merge prep.\nSteps truncated}}
        notea4{{Merge prep.\nSteps truncated}}
        %% p57[RENAME_METADATA]
        %% p58[MERGE_PREP]
        %% p59[RENAME_VCF]
        %% p60[COMPRESS_INDEX_VCF]
        %% p61[BCFTOOLS_SPLITMULTIALLELIC]
        %% p62[SPLIT_MNV]
        %% p63[GATK_SORTVCF_TOOLS]

        o5 --> note1
        o6 --> note2
        o7 --> notea4
        o8 --> note3

    end
    

    note1 --> p64
    note2 --> p64
    note3 --> p64
    notea4 --> p64
    p64[BCFTOOLS_MERGECALLERS]
    %% p65[COMPRESS_INDEX_VCF_ALL_CALLERS]
    p64 --> note4
    subgraph lancet_confirm [  ]

        %% NOTE: There are many lancet confirm steps. First prep, then confirm, then re-merge.  
        %% p66[BEDTOOLS_STARTCANDIDATES]
        %% p67[GET_CANDIDATES]
        %% p68[COMPRESS_INDEX_VCF_REGION]
        %% p69[VCF_TO_BED]
        %% p70[LANCET_CONFIRM]
        %% p71[COMPRESS_INDEX_VCF_REGION_LANCET]
        %% p72[BCFTOOLS_INTERSECTVCFS]
        %% p73[RENAME_METADATA_LANCET]
        %% p74[MERGE_PREP_LANCET]
        %% p75[RENAME_VCF_LANCET]
        %% p76[COMPRESS_INDEX_VCF_LANCET]
        %% p77[BCFTOOLS_SPLITMULTIALLELIC_LANCET]
        %% p78[SPLIT_MNV_LANCET]
        %% p79[REMOVE_CONTIG]
        %% p80[GATK_SORTVCF_TOOLS_LANCET]
        %% p81[BCFTOOLS_MERGECALLERS_FINAL]
        %% p82[COMPRESS_INDEX_VCF_MERGED]
        %% p83[MERGE_COLUMNS]
        %% p84[ADD_NYGC_ALLELE_COUNTS]
        %% p85[ADD_FINAL_ALLELE_COUNTS]
        %% p86[FILTER_PON]
        %% p87[FILTER_VCF]
        %% p88[SNV_TO_MNV_FINAL_FILTER]
        %% p89[GATK_SORTVCF_SOMATIC]
        %% p90[REORDER_VCF_COLUMNS]
        %% p91[COMPRESS_INDEX_MERGED_VCF]
        note0[[The steps in this subgraph are truncated for clarity]]
        note4{{Extract non exonic variant calls}}
        note5{{Confirm non exonic variant calls with Lancet}}
        note6{{Merge Lancet confirmed to exonic calls}}
       
        note4 --> note5
        note5 --> note6
        %% note6 --> note7
    end

    subgraph snv_annotate [  ]
        p92[VEP_SOMATIC]
        p93[COSMIC_ANNOTATION_SOMATIC]
        p94[COSMIC_CANCER_RESISTANCE_MUTATION_SOMATIC]
        p95[SNPSIFT_ANNOTATE_DBSNP_SOMATIC]
        p96[SOMATIC_VCF_FINALIZATION]
        o11([Annotated filtered\nSomatic SNV and InDELs Calls]):::output

        %% p105[:FILTER_BEDPE]
        %% p106[:FILTER_BEDPE_SUPPLEMENTAL]
        note6 --> p92
        p92 --> p93 
        p93 --> p94 
        p94 --> p95 
        p95 --> p96
        p96 --> o11
        %% note2{{}}
        
    end

    subgraph cnv_sv_annotate [  ]
        o10 --> p97
        p97[ANNOTATE_BICSEQ2_CNV]
        o12([Annotated CNV Regions])
        o7 --> p98 
        o9 --> p98 
        p98[MERGE_SV]
        p99[ANNOTATE_SV]
        p100[ANNOTATE_SV_SUPPLEMENTAL]
        p101[ANNOTATE_GENES_SV]
        p102[ANNOTATE_GENES_SV_SUPPLEMENTAL]

        p103[ANNOTATE_SV_WITH_CNV]
        p104[ANNOTATE_SV_WITH_CNV_SUPPLEMENTAL]
        o13([Annotated SV Calls])
        o14([Annotated SV])

        p97 --> o12

        p98 --> p99
        p99 --> p100
        p100 --> p101
        p101 --> p102
        p102 --> o13

        p102 --> p103
        p97 --> p103
        p103 --> p104
        p104 --> o14
    end

    o11 ~~~ note42

    subgraph qc [  ]
        temp0((Fastq Files\nFrom Above))
        temp1((Genomic BAMs\nFrom Above))
        temp2((Logs from:\nBWA\nTrimming\nMark Duplicates))
        p03[FASTQC]
        p13[PICARD_COLLECTALIGNMENTSUMMARYMETRICS]
        p14[PICARD_COLLECTWGSMETRICS]
        p142[MULTIQC]
        o15([MultQC Report]):::output
        note42[[For clarity\nQC steps not connected to main graph]]
        temp0 --> p03
        temp1 --> p13
        temp1 --> p14
        temp2 --> p142
        p03 --> p142
        p13 --> p142
        p14 --> p142
        p142 --> o15
        %% n02 --> p03
        %% t02 --> p03
        %% to1 --> p13
        %% to1 --> p14
        %% no1 --> p13
        %% no1 --> p14
    end

subgraph msi [  ]
    t01 -->  p56[MSISENSOR2_MSI] -->  msi_output([MSI Status])
end


classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

style tumor stroke:#333,stroke-width:2px
style normal stroke:#333,stroke-width:2px
style germline stroke:#333,stroke-width:2px
style conpair stroke:#333,stroke-width:2px
style somatic_variant stroke:#333,stroke-width:2px
style lancet_confirm stroke:#333,stroke-width:2px
style snv_annotate stroke:#333,stroke-width:2px
style cnv_sv_annotate stroke:#333,stroke-width:2px
style msi stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
Loading
  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: /projects/omics_share/meta/containers
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --csv_input

    • Default: /<FILE_PATH>
    • Comment: CSV delimited sample sheet that controls how samples are processed. The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. See note below on this page for additional information on file format.
  • --pdx

    • Default: false
    • Comment: Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis.
  • --xengsort_host_fasta

    • Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
    • Comment: Xengsort host fasta file. Used by Xengsort Index when --pdx is run, and xengsort_idx_path is null or false.
  • --xengsort_idx_path

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort'
    • Comment: Xengsort index for deconvolution of human and mouse reads. Used when --pdx is run. If null, Xengsort Index is run using ref_fa and host_fa.
  • --xengsort_idx_name

    • Default: 'hg38_GRCm39-NOD_ShiLtJ'
    • Comment: Xengsort index name associated with files located in xengsort_idx_path or name given to outputs produced by Xengsort Index.
  • --deduplicate_reads

    • Default: false
    • Comment: Options: false, true. If specified, run bbmap clumpify on input reads. Clumpify will deduplicate reads prior to trimming. This can help with mapping and downstream steps when analyzing high coverage WGS data.
  • --coverage_cap

  • --primary_chrom_bed

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/Homo_sapiens_assembly38.primary_chrom.bed'
    • Comment: A bed file containing the primary chromsomes with positions. Used in limiting jvarkit 'Biostar154220' to those regions with expected coverage.
  • --split_fastq

    • Default: false
    • Comment: If specified, FASTQ files will be split into chunks sized based on split_fastq_bin_size prior to mapping. This option is recommended for high coverage data.
  • --split_fastq_bin_size

    • Default: 10000000
    • Comment: If split_fastq is specified, FASTQ files will splint into chunks of this size prior to mapping.
  • --quality_phred

    • Default: 15
    • Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
  • --unqualified_perc

    • Default: 40
    • Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
  • --detect_adapter_for_pe

    • Default: false
    • Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
  • --ref_fa

    • Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
    • Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis, points to human reference when --gen_org human. JAX users should not change this parameter.
  • --ref_fa_indices

    • Default: '/projects/omics_share/human/GRCh38/genome/indices/gatk/bwa/Homo_sapiens_assembly38.fasta'
    • Comment: Pre-compiled BWA index files. JAX users should not change this parameter.
  • --ref_fa_dict

    • Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.dict'
    • Comment: FASTA dictonary file. JAX users should not change this parameter.
  • --combined_reference_set

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/combined_ref_set/Homo_sapiens_assembly38.fasta'
    • Comment: Several tools (GRIDSS, SVABA) requires reference and bwa index files in same directory. Links used within this directory to avoid duplication of fasta and bwa indicies. See note in directory.
  • --mismatch_penalty

    • Default: -B 8
    • Comment: The BWA penalty for a mismatch.
  • --gold_std_indels

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz’
    • Comment: Used in GATK BaseRecalibrator and variant tranche recalibration derived from the GATK resource bundle. JAX users should not change this parameter.
  • --phase1_1000G

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
    • Comment: Used in GATK BaseRecalibrator derived from the GATK resource bundle. JAX users should not change this parameter.
  • --dbSNP

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz'
    • Comment: Used in variant annotation, GATK BaseRecalibrator, variant tranche recalibration, and by SVABA. JAX users should not change this parameter.
  • --dbSNP_index

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz.tbi'
    • Comment: Index associated with the dbsnp file.
  • --chrom_contigs

    • Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.primaryChr.contig_list'
    • Comment: Contig list used for scatter / gather in calling and annotation.
  • --chrom_intervals

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/intervals/hg38_calling_intervals/'
    • Comment: Chromosome intervals used for scatter gather in calling.
  • --call_val

    • Default: 50
    • Comment: The minimum phred-scaled confidence threshold at which variants should be called.
  • --ploidy_val

    • Default: '-ploidy 2'
    • Comment: Sample ploidy used by Haplotypecaller in germline small variant calling.
  • --excludeIntervalList

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/hg38_haplotypeCaller_skip.interval_list'
    • Comment: Germline caller exclusion list.
  • --hapmap

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/hapmap_3.3.hg38.vcf.gz'
    • Comment: variant tranche recalibration requirement derived from the GATK resource bundle.
  • --omni

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_omni2.5.hg38.vcf.gz'
    • Comment: variant tranche recalibration requirement derived from GATK resource bundle.
  • --pon_bed

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/WGS_1000g_GRCh38.pon.bed'
    • Comment: Panel of normal samples used in in snp and indel filtering.
  • --intervalListBed

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla.interval_list.bed'
    • Comment: This file is used to extract small variants in non-exonic regions. Such calls are then attempted to be recovered via Lancet calls.
  • --lancet_beds_directory

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/intervals/lancet_chr_beds/'
    • Comment: Lancet interval bed files used in calling by that tool.
  • --mappability_directory

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/mappability'
    • Comment: Bicseq2 input requirement. Derived from the tool developer resource pack.
  • --bicseq2_chromList

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/configs/sampleId.bicseq2.config'
    • Comment: Bicseq2 config requirement. Derived from the tool developer resource pack.
  • --bicseq2_no_scaling

    • Default: false
    • Comment: false: estimate 'lamda' smoothing factor from data for CNV profile calling. true: Use standard 'lamda 4' smoothing for CNV profile calling. If BicSeq2 fails with an error, set this parameter to 'true'.
  • --germline_filtering_vcf

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/gnomad-and-ALL_GRCh38_sites.20170504.normalized.modified.PASS.vcf.gz'
    • Comment: Germline reference file used in Gridss SV call filtering. Provided by the tool developer resource pack.
  • --gripss_pon

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/gripss_pon'
    • Comment: Panel of normal files for Gripss SV call filering. Provided by the tool developer resource pack.
  • --callRegions

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/GRCh38.callregions.bed.gz'
    • Comment: Manta calling regions. Provided by the tool developer resource pack.
  • --strelka_config

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/configs/configureStrelkaSomaticWorkflow.py.ini'
    • Comment: Strelka input configuration. Provided by the tool developer resource pack.
  • --msisensor_model

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/msisensor2/models_hg38'
    • Comment: Model files for MSI calling via MSIsensor2. Provided by the tool developer resource pack.
  • --vep_cache_directory

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/vep_data'
    • Comment: VEP annotation cache. Cache provided is for Ensembl v109.
  • --vep_fasta

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/sequence/ensembl/GRCh38.p13/Homo_sapiens.GRCh38.dna.primary_assembly.fa'
    • Comment: VEP requires an ensembl based fasta. GRCh38.p13 is used for v97-v109.
  • --cosmic_cgc

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/function/cancer_gene_census_v97.csv'
    • Comment: COSMIC Cancer Gene Census annotation file. Index for file required within same location.
  • --cosmic_cancer_resistance_muts

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/function/CosmicResistanceMutations.tsv.gz'
    • Comment: COSMIC Resistance Mutations file. Index for file required within same location.
  • --ensembl_entrez

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/GRCh39.p13_ensemblv109_entrez_id_map.csv'
    • Comment: Ensembl to Entrez gene ID to HGNC symbol mapping file. used in somatic vcf finalization.
  • --cytoband

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/GRCh38.cytoBand.UCSC.chr.sorted.txt'
    • Comment: File used in bicseq2 annotations
  • --dgv

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/DGV.GRCh38_hg38_variants_2020-02-25.bed'
    • Comment: File used in bicseq2 annotations
  • --thousandG

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/1KGP.CNV.GRCh38.canvas.merged.bed'
    • Comment: File used in bicseq2 annotations
  • --cosmicUniqueBed

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/CosmicCompleteCNA_uniqIntervals.bed'
    • Comment: File used in bicseq2 annotations
  • --cancerCensusBed

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/cancer_gene_census.GRCh38-v92.bed'
    • Comment: File used in bicseq2 annotations and SV annotation.
  • --ensemblUniqueBed

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/ensembl_genes_unique_sorted.final.v93.chr.sorted.bed'
    • Comment: File used in bicseq2 annotations and SV annotation.
  • --gap

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/GRCh38.gap.UCSC.annotated.chr.sorted.bed'
    • Comment: File used in SV annotation.
  • --dgvBedpe

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/DGV.GRCh38_hg38_variants_2020-02-25.bedpe'
    • Comment: File used in SV annotation.
  • --thousandGVcf

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/1KGP.pruned_wAFs.PASS_and_MULTIALLELIC_Mosaic.GRCh38.vcf'
    • Comment: File used in SV annotation.
  • --svPon

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/1000G-SV-PON.survivor-merged.GRCh38.filtered.bedpe'
    • Comment: File used in SV annotation.
  • --cosmicBedPe

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/cosmic-sv-GRCh38-v92.bedpe'
    • Comment: File used in SV annotation.
  • --na12878_bam

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/NA12878/NA12878_realigned_BQSR.bam'
    • Comment: NA12878 BAM file. Used in un-paired sample analysis.
  • --na12878_bai

    • Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/NA12878/NA12878_realigned_BQSR.bai'
    • Comment: NA12878 BAM index file. Used in un-paired sample analysis.
  • --na12878_sampleName

    • Default: 'ERR194147_1.fastq.gz_filtered_trimmed'
    • Comment: NA12878 sample name within the NA12878 BAM file.
  • --read_type

    • Default: PE
    • Comment: Only 'PE' is accepted for this workflow.

CSV Input Sample Sheet

The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The patient column defines how samples are paired. All combinations of normal and tumor samples that share the same patient ID will be paired.
  • The sex column is unused in the workflow at this time.
  • The status column defines if each sample is either 'normal': 0 or 'tumor': 1.
  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example paired analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In the example case above the following output directories will be generated:

PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

An example paired analysis with multiple lanes:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE1_R2.fastq.gz
PATIENT_42,XX,0,NORMAL_1,L2,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE2_R2.fastq.gz
PATIENT_42,XX,0,NORMAL_1,L3,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE3_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In the example case above the three lanes provided for the normal sample will be concatenated and the concatenated reads will be passed forward for analysis. Samples with a single lane will be passed forward for analysis. A mix of samples with multiple lanes, and single lanes can be provided.

The following output directories will be generated:

PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

An example unpaired analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

Note: In cases when tumor is provided without matched normal, NA12878 is used as a proxy normal sample in somatic small variant, and somatic structural variant calling. CNV calling is done with BicSeq2 on the tumor sample alone. Germline calling on NA12878 is not done. Sample contamination and concordance via Conpair is also not done against NA12878. A mix of samples with and without pairs can also be provided.

The output directory structure of tumor only samples will be as follows:

PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NA12878: Contains all TUMOR_1 by NA12878 specific files

Additional information on outputs is provided below.

An example of mixed paired and unpaired analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_101,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_101,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

The output directory structure of samples will be as follows:

PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NA12878: Contains all TUMOR_1 by NA12878 specific files
PATIENT_101--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_101--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

Note: In cases when tumor is provided without matched normal, NA12878 is used as a proxy normal sample in somatic small variant, and somatic structural variant calling. CNV calling is done with BicSeq2 on the tumor sample alone. Germline calling on NA12878 is not done. Sample contamination and concordance via Conpair is also not done against NA12878. A mix of samples with and without pairs can also be provided.

The output directory structure of tumor only samples will be as follows:

PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NA12878: Contains all TUMOR_1 by NA12878 specific files

Additional information on outputs is provided below.

Many samples for one patient:

The workflow supports the mapping on one to many, many to one, and many to many normal and tumor samples.

An example one to many analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_2,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_3,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In cases of one to many, many to one, and many to many all combinations of samples will be processes against one another.

In the example case above the following output directories will be generated:

PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_2: Contains all TUMOR_2 specific files
PATIENT_42--TUMOR_3: Contains all TUMOR_3 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
PATIENT_42--TUMOR_2--NORMAL_1: Contains all TUMOR_2 by NORMAL_1 specific files
PATIENT_42--TUMOR_3--NORMAL_1: Contains all TUMOR_3 by NORMAL_1 specific files

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

NOTE: All files contained in 'stats' directories are captured by MultiQC reports.

The pipelines will output several directories relative to files that apply to individual sample or combinations of samples.

Following the example naming in the csv section above for "an example paired analysis":

Normal specific results:

PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files

Naming Convention Description
*_haplotypecaller.gatk.filtered.vcf.gz Final filtered SNP and InDEL calls from haplotypecaller.
*_germline_snv_indel_annotated_filtered_final.vcf Final filtered SNP and InDEL calls from haplotypecaller with VEP annotations.
bam/*_realigned_BQSR.bam Final duplicate marked, BQSR realigned bam file used in calling.
bam/*_realigned_BQSR.bai Bam index file.
stats/*_stat BWA alignment metrics.
stats/*_AlignmentMetrics.txt GATK Alignment metrics.
stats/*_CollectWgsMetrics.txt GATK collect WGS metrics output.
stats/*_recal_data.table GATK Baserecalibration table.
stats/*_dup_metrics.txt Picard mark duplicates output.
stats/*_insert_size.txt Estimated library insert size.
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.

NOTE: When tumor-only samples are run, there will be no <PATIENT>--NA12878 directory output. As all files associated with NA12878 specifically are not relevant.

Tumor specific results:

PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files

Naming Convention Description
bam/*_realigned_BQSR.bam Final duplicate marked, BQSR realigned bam file used in calling.
bam/*_realigned_BQSR.bai Bam index file.
stats/*_stat BWA alignment metrics.
stats/*_AlignmentMetrics.txt GATK Alignment metrics.
stats/*_CollectWgsMetrics.txt GATK collect WGS metrics output.
stats/*_recal_data.table GATK Baserecalibration table.
stats/*_dup_metrics.txt Picard mark duplicates output.
stats/*_insert_size.txt Estimated library insert size.
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html FastQC report.
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip FastqQC report.
msi/*msisensor MSI Status. "The recommended msi score cutoff value is 20% (msi high: msi score >= 20%)"

Paired sample results:

PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Naming Convention Description
*_concordance.txt Sample concordance from Conpair.
*_contamination.txt Sample contamination from Conpair.
*_cnv_annotated_final.bed Final CNV calls restricted to high confidence and with provided with annotations.
*_cnv_annotated_supplemental.bed All CNV calls with annotations.
*_somatic_snv_indel_annotated_filtered_final.vcf Final filtered somatic SNVs and InDELs based on Mutect2, Strelka, and supported by Lancet.
*_somatic_snv_indel_annotated_filtered_supplemental.vcf Supplementary information from filtered somatic SNVs and InDELs based on Mutect2, Strelka, and supported by Lancet.
*_somatic_snv_indel_annotated_filtered_final.txt Text extraction from the VCF filtered somatic SNVs and InDELs based on Mutect2, Strelka, and supported by Lancet.
*_somatic_snv_indel_annotated_filtered_final.maf Maf format converted from somatic SNVs and InDELs VCF.
*_sv_annotated_somatic_final.bedpe Somatic structural variant calls with polished annotations.
*_sv_annotated_somatic_high_confidence_final.bedpe Somatic structural variant calls restricted to high confidence calls with polished annotations.
*_sv_annotated_somatic_supplemental.bedpe Somatic structural variant calls with all annotations.
*_sv_annotated_somatic_high_confidence_supplemental.bedpe Somatic structural variant calls restricted to high confidence calls with all annotations.
callers/*.bicseq2.png Bicseq2 output.
callers/*.bicseq2.txt Bicseq2 raw CNV calls.
callers/*_lancet_merged.vcf.gz Lancet raw SNP/InDEL calls.
callers/*_manta_candidateSmallIndels.vcf.gz Manta raw small indel calls.
callers/*_manta_candidateSV.vcf.gz Manta raw candidate SV calls.
callers/*_manta_diploidSV.vcf.gz Manta raw diploid SV calls.
callers/*_manta_somaticSV.vcf.gz Manta raw somatic SV calls, these are the calls that are merged with other SV callers.
callers/*_mutect2_somatic.filtered.vcf.gz Mutect2 calls filtered by GATK 'filtermutectcalls'.
callers/*_strelka_somatic.indels.vcf.gz Strelka raw InDEL calls.
callers/*_strelka_somatic.snvs.vcf.gz Strelka raw snv calls.
callers/*_.gripss.filtered.vcf.gz Gridss SV calls as filtered by Gripss.
stats/*_mutect2_somatic.filtered.vcf.gz.filteringStats.tsv Filter Mutect2 Calls QC stats output.

Additional result output:

Naming Convention Description
pta_report.html Nextflow autogenerated report.
trace Nextflow autogenerated trace report for resource usage in tabular text format.
multiqc MultiQC report summarizing quality metrics across samples in the analysis run.

Pipeline Options Outputs

If the workflow is run with --keep_intermediate true additional outputs will be saved out. This option is only recommended for debugging purposes.

Clone this wiki locally