Somatic WES PTA Pipeline ReadMe

Somatic Whole Exome Sequencing (WES) Paired Tumor Analysis (PTA) Documentation

Somatic WES PTA Pipeline (--workflow somatic_wes_pta)

Note Steps 1-6 are done on tumor and normal samples individually.

•	Step 1: FastP read and adapter trimming    
•	Step 2: Get Read Group Information   
•	Step 3 (optional, run for PDX): Xengsort human / mouse read disambiguation    
•	Step 4: BWA-MEM Alignment   
•	Step 5: Variant Preprocessing - Part 1 (Picard sortsam/mark duplicates)  
•	Step 6: Variant Pre-Processing – Part 2 (GATK Base Recalibrator Apply BQSR)  
•	Step 7: Sample contamination analysis & if FFPE: read orientation modeling
•	Step 8: Variant Pre-Processing – Part 2 (GATK Base Recalibrator Apply BQSR)  	
•	Step 9: Microsatellite Instability analysis with MSIsensor2   	
•	Step 10: Copy number variation calling with Sequenza  	
•	Step 11: Homologous recombination deficiency (HRD) with scarHRD  	
•	Step 12: Variant Calling (GATK Mutect2)  
•	Step 13: Variant Filtration (GATK FilterMutectCalls)   
•	Step 14: Post Variant Calling Annotation - Part 1 (Cosmic, SnpEff, SnpSift)    
•	Step 15: Tumor mutation burden calling   	
•	Step 16: Picard Collect HS Metrics  
•	Step 17: MultiQC report generation

flowchart TD
    p00((CSV Sample Sheet))
    p01[PARSE_SAMPLE_SHEET:CONCATENATE_READS]
    p00 --> p01
    p01 --> |Tumor Sample| t02
    p01 --> |"Normal Sample"| n02


    subgraph tumor [  ]

        t02[FASTP]
        opt1[XENGSORT_CLASSIFY]
        t05[BWA_MEM]
        t06[PICARD_SORTSAM]
        t10[PICARD_MARKDUPLICATES]
        t11[GATK_BASERECALIBRATOR]
        t12[GATK_APPLYBQSR]
        %% t16[GATK_GETSAMPLENAME_TUMOR]
        t01([Tumor Genomic Bam]):::output

        t02 -..-> |PDX Sample| opt1

        t02 --> |Human Sample| t05
        opt1 -..-> |Human Reads| t05
        t05 --> t06
        t06 --> t10
        t10 --> t11
        t11 --> t12
        t12 --> t01
        
    end

    subgraph normal [  ]
        n02[FASTP]
        n05[BWA_MEM]
        n06[PICARD_SORTSAM]
        n10[PICARD_MARKDUPLICATES]
        n11[GATK_BASERECALIBRATOR]
        n12[GATK_APPLYBQSR]
        %% n15[GATK_GETSAMPLENAME_NORMAL]
        n01([Normal Genomic Bam]):::output

        n02 --> n05
        n05 --> n06
        n06 --> n10
        n10 --> n11
        n11 --> n12
        n12 --> n01
    end

    n01 --> m1
    t01 --> m1


    m1((Join:\nTumor & Normal))

    m1 --> p14
    m1 --> pn1
    m1 -.FFPE.-> pn3

    subgraph somatic_variant [  ]
		pn1[GATK_GETPILEUPSUMMARIES]
		pn2[GATK_CALCULATECONTAMINATION]
		pn3[GATK_LEARNREADORIENTATIONMODEL]
		p14[GATK_MUTECT2]
		p15[GATK_FILTERMUECTCALLS]
		o3([Raw Variant Calls]):::output
		p16[GATK_SELECTVARIANTS_SNP]
		p17[GATK_VARIANTFILTRATION_SNP]
		p18[GATK_SELECTVARIANTS_INDEL]
		p19[GATK_VARIANTFILTRATION_INDEL]
		p20[SNPSIFT_ANNOTATE_SNP_DBSNP]
		p21[SNPSIFT_ANNOTATE_SNP_COSMIC]
		p22[SNPEFF_SNP]
		p23[SNPSIFT_DBNSFP_SNP]
		p24[SNPEFF_ONEPERLINE_SNP]
		p25[SNPSIFT_ANNOTATE_INDEL_DBSNP]
		p26[SNPSIFT_ANNOTATE_INDEL_COSMIC]
		p27[SNPEFF_INDEL]
		p28[SNPSIFT_DBNSFP_INDEL]
		p29[SNPEFF_ONEPERLINE_INDEL]
		p30[GATK_MERGEVCF_UNANNOTATED]
		o4([Filtered Unannoated VCF]):::output
		p31[GATK_MERGEVCF_ANNOTATED]
		o5([Filtered Annotated VCF]):::output
		p32[SNPSIFT_EXTRACTFIELDS]
		o6([Variant Table]):::output
		tmb1[TMB_SCORE]
		o7([Tumor Mutation\nBurden Score]):::output
		note233[[Somatic Variant Calling]]

		pn1 --> pn2

		p14 --> p15
		pn2 --> p15
		pn3 -.FFPE.-> p15
		p15 --> o3
		o3 --> p16
		o3 --> p18
		p16 --> p17
		p18 --> p19
		p17 --> p20
		p20 --> p21
		p21 --> p22
		p22 --> p23
		p23 --> p24
		p19 --> p25
		p25 --> p26
		p26 --> p27 
		p27 --> p28
		p28 --> p29
		p17 --> p30
		p19 --> p30
		p30 --> o4
		p24 --> p31
		p29 --> p31
		p31 --> o5
		o4 --> p32
		p32 --> o6

		o4 --> tmb1
    	tmb1 --> o7

    end
    
    n01 --> an3

    subgraph ancestry [ ]
		an3[BCFTOOLS_MPILEUP]
		an4[BCFTOOLS_CALL]
		an5[BCFTOOLS_FILTER]
		an6[BCFTOOLS_ANNOTATE]
		an7[VCF2EIGENSTRAT]
		an8[SNPWEIGHTS_INFERANC]
		
		po1([Genetic Ancestry Estimation]):::output
		note1233[[Genetic Ancestry]]

		an3 --> an4
		an4 --> an5
		an5 --> an6
		an6 --> an7
		an7 --> an8
		an8 --> po1

		po1 ~~~ note1233
    end

    n01 --> cnv1
    t01 --> cnv2

    subgraph cnv [ ]
		cnv2[GATK_PRINTREADS_TUMOR]
		cnv1[GATK_PRINTREADS_NORMAL]
		cnv3[SAMTOOLS_MPILEUP_NORMAL]
		cnv4[SAMTOOLS_MPILEUP_TUMOR]
		cnv5[SEQUENZA_PILEUP2SEQZ]
		cnv6[SCARHRD]
		cnv7[SEQUENZA_RUN]
		cnv8[SEQUENZA_FILTER_AND_ANNOTATE]

		cnvo1([HRD Status]):::output
		cnvo2([CNV Calls with Gene Annotations]):::output
		note1212[[Copy Number Variation\nCalling]]
		cnv1 --> cnv3
		cnv2 --> cnv4
		cnv3 --> cnv5
		cnv4 --> cnv5
		cnv5 --> cnv6
		cnv6 --> cnvo1
		cnv5 --> cnv7
		cnv7 --> cnv8
		cnv8 --> cnvo2
    end

    o6 ~~~ note42

    subgraph qc [  ]
        temp0((Fastq Files\nFrom Above))
        temp1((BAMs\nFrom Above))
        temp2((Logs from:\nFASTP\nBWA\nMarkDuplicates))
        qc03[FASTQC]
        qc13[PICARD_COLLECTHSMETRICS]
        qc142[MULTIQC]
        o15([MultQC Report]):::output
        note42[[For clarity\nQC steps not connected to main graph]]
        temp0 --> qc03
        temp1 --> qc13
        temp2 --> qc142
        qc03 --> qc142
        qc13 --> qc142
        qc142 --> o15
        %% n02 --> p03
        %% t02 --> p03
        %% to1 --> p13
        %% to1 --> p14
        %% no1 --> p13
        %% no1 --> p14
    end

	subgraph msi [  ]
		t01 -->  p56[MSISENSOR2_MSI] -->  msi_output([MSI Status]):::output ~~~ note122
		note122[[Microsatellite Instability \nCalling]]
	end


classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style tumor stroke:#333,stroke-width:2px
style normal stroke:#333,stroke-width:2px
style somatic_variant stroke:#333,stroke-width:2px
style msi stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px

Parameters for Somatic WES Pipeline

--pubdir
- Default: /<PATH>
- Comment: The directory that the saved outputs will be stored.
--organize_by
- Default: sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
--cacheDir
- Default: /projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
-w
- Default: /<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
--sample_folder
- Default: /<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
--extension
- Default: .fastq.gz
- Comment: The expected extension for the input read files.
--pattern
- Default: "*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
--read_type
- Default: PE
- Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
--concat_lanes
- Default: false
- Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
--download_data
- Default: null
- Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
--gen_org
- Default: human
- Comment: Options: human.
--ref_fa
- Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis. JAX users should not change this parameter.
--ref_fa_indices
- Default: '/projects/omics_share/human/GRCh38/genome/indices/gatk/bwa/Homo_sapiens_assembly38.fasta'
- Comment: Pre-compiled BWA index files. JAX users should not change this parameter.
--quality_phred
- Default: 15
- Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
--unqualified_perc
- Default: 40
- Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
--detect_adapter_for_pe
- Default: false
- Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
--pdx
- Default: false
- Comment: Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis.
--xengsort_host_fasta
- Default: '/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
- Comment: Xengsort host fasta file. Used by Xengsort Index when --pdx is run, and xengsort_idx_path is null or false.
--xengsort_idx_path
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort'
- Comment: Xengsort index for deconvolution of human and mouse reads. Used when --pdx is run. If null, Xengsort Index is run using ref_fa and host_fa.
--xengsort_idx_name
- Default: 'hg38_GRCm39-NOD_ShiLtJ'
- Comment: Xengsort index name associated with files located in xengsort_idx_path or name given to outputs produced by Xengsort Index.
--ffpe
- Default: false
- Comment: Options: false, true. If specified for FFPE derived samples, GATK LearnReadOrientationModel is run (per GATK best practices) and used as an additional filter of somatic calls.
--hg38_windows
- Default: /projects/compsci/omics_share//human/GRCh38/genome/annotation/intervals/hg38_chrom_sizes.window.1000000.bed
- Comment: GRCh38 broken into 1000000bp windows. This file is used in tumor mutation burden calculation.
--genotype_targets
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file for the ancestry panel. Can contain annotation information.
--snpID_list
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2.list'
- Comment: Target SNPs in list used in BCFtools filtering step.
--snp_annotations
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file with annotations for the ancestry panel.
--snpweights_panel
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/ancestry_panel_v2.snpwt'
- Comment: SNP weights panel in the appropriate format.
--target_gatk
- Default: '/projects/omics_share/human/GRCh38/supporting_files/capture_kit_files/agilent/v7/S31285117_MergedProbes_no_gene_names.bed'
- Comment: A bed file with WES target intervals as defined in the capture array used in the data. NOTE: This file MUST reflect the capture array used to generate your data.
--target_picard
- Default: '/projects/omics_share/human/GRCh38/supporting_files/capture_kit_files/agilent/v7/S31285117_MergedProbes_no_gene_names.picard.interval_list'
- Comment: A GATK interval file covering WES target intervals. Used in calculating coverage metrics. NOTE: This file MUST reflect the capture array used to generate your data.
--bait_picard
- Default: '/projects/omics_share/human/GRCh38/supporting_files/capture_kit_files/agilent/v7/S31285117_MergedProbes_no_gene_names.picard.interval_list'
- Comment: A GATK interval file covering WES target intervals. Used in calculating coverage metrics. This file can be the same as the interval file, NOTE: This file MUST reflect the capture array used to generate your data.
--mismatch_penalty
- Default: -B 8
- Comment: The BWA penalty for a mismatch.
--gnomad_ref
- Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/af-only-gnomad.hg38.vcf.gz'
- Comment: GnomAD germline reference from GATK resource pack.
--pon_ref
- Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/1000g_pon.hg38.vcf.gz'
- Comment: 1000 genome germline panel of normals from GATK resource pack.
--genotype_pon
- Default: true
- Comment: Call sites in the PoN even though they will ultimately be filtered.
--genotype_germline
- Default: true
- Comment: Call all apparent germline site even though they will ultimately be filtered.
--contam_ref
- Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/small_exac_common_3.hg38.vcf.gz'
- Comment: File used in GetPileupSummaries and CalculateContaminationcommon. A germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field is used from GATK resource bundle.
--sequenza_gc
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/sequenza/Homo_sapiens_assembly38_gc50.wig.gz'
- Comment: GC content windows used by Sequenza in CNV analysis. Generated from sequenza-utils gc_wiggle
--ensembl_database
- Default: '/projects/compsci/omics_share/human/GRCh38/transcriptome/annotation/ensembl/v110/GRCm39_ensemblv110_annotDB.txt'
- Comment: Ensembl gene annotation file, used to annotate CNV results from Sequenza.
--dbSNP
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz'
- Comment: The dbSNP database contains known single nucleotide polymorphisms, and is used in the annotation of known variants. JAX users should not change this parameter.
--dbSNP_index
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz.tbi'
- Comment: dbDNP index file.
--gen_ver
- Default: "hg38"
- Comment: snpEff genome version.
--snpEff_config
- Default: '/projects/omics_share/human/GRCh38/genome/indices/snpEff_5_1/snpEff.config'
- Comment: The configuration file used while running snpEff. JAX users should not change this parameter.
--gold_std_indels
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz’
- Comment: Used in GATK BaseRecalibrator. JAX users should not change this parameter.
--phase1_1000G
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
- Comment: Used in GATK BaseRecalibrator. JAX users should not change this parameter.
--dbNSFP
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/function/dbNSFP4.2a.gatk_formatted.txt.gz'
- Comment: Used in variant annotation.
--cosmic
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/function/COSMICv95_Coding_Noncoding.gatk_formatted.vcf'
- Comment: COSMIC annotations.
--cosmic_index
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/function/COSMICv95_Coding_Noncoding.gatk_formatted.vcf.gz.tbi'
- Comment: COSMIC annotation index file.
--msisensor_model
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/msisensor2/models_hg38'
- Comment: MSIsensor2 model files
--multiqc_config
- Default: ${projectDir}/bin/shared/multiqc/somatic_wes_multiqc.yaml
- Comment: The path to the configuration file used by MultiQC

CSV Input Sample Sheet

The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

The patient column defines how samples are paired. All combinations of normal and tumor samples that share the same patient ID will be paired.
The sex column is used in CNV analysis. Options are XX, XY, or NA. If sample is NA, it is run as 'XY' in the CNV anaylsis.
The status column defines if each sample is either 'normal': 0 or 'tumor': 1.
The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
The fastq_1 and fastq_2 columns must contain absolute paths to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example paired analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In the example case above the following output directories will be generated:

PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

An example paired analysis with multiple lanes:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE1_R2.fastq.gz
PATIENT_42,XX,0,NORMAL_1,L2,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE2_R2.fastq.gz
PATIENT_42,XX,0,NORMAL_1,L3,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE3_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In the example case above the three lanes provided for the normal sample will be concatenated and the concatenated reads will be passed forward for analysis. Samples with a single lane will be passed forward for analysis. A mix of samples with multiple lanes, and single lanes can be provided.

The following output directories will be generated:

PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files

Additional information on outputs is provided below.

Many samples for one patient:

The workflow supports the mapping on one to many, many to one, and many to many normal and tumor samples.

An example one to many analysis:

patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_2,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_3,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz

In cases of one to many, many to one, and many to many all combinations of samples will be processes against one another.

In the example case above the following output directories will be generated:

PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_2: Contains all TUMOR_2 specific files
PATIENT_42--TUMOR_3: Contains all TUMOR_3 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
PATIENT_42--TUMOR_2--NORMAL_1: Contains all TUMOR_2 by NORMAL_1 specific files
PATIENT_42--TUMOR_3--NORMAL_1: Contains all TUMOR_3 by NORMAL_1 specific files

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

NOTE: All files contained in 'stats' directories are captured by MultiQC reports.

The pipelines will output several directories relative to files that apply to individual sample or combinations of samples.

Following the example naming in the csv section above for "an example paired analysis":

Normal specific results:

`PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files`

Naming Convention	Description
`*ancestry.tsv`	Genetic ancestry report. See https://www.biorxiv.org/content/10.1101/2022.10.24.513591v1 for details on report and methods.
`bam/*_realigned_BQSR.bam`	Final duplicate marked, BQSR realigned bam file used in calling.
`bam/*_realigned_BQSR.bai`	Bam index file.
`stats/*fastp_report.html`	FASTP trimming report
`stats/*_stat`	BWA alignment metrics.
`stats/*_AlignmentMetrics.txt`	GATK Alignment metrics.
`stats/*_CollectWgsMetrics.txt`	GATK collect WGS metrics output.
`stats/*_recal_data.table`	GATK Baserecalibration table.
`stats/*_dup_metrics.txt`	Picard mark duplicates output.
`stats/*_insert_size.txt`	Estimated library insert size.
`stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html`	FastQC report.
`stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html`	FastQC report.
`stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip`	FastqQC report.
`stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip`	FastqQC report.

Tumor specific results:

`PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files`

Naming Convention	Description
`bam/*_realigned_BQSR.bam`	Final duplicate marked, BQSR realigned bam file used in calling.
`bam/*_realigned_BQSR.bai`	Bam index file.
`stats/*fastp_report.html`	FASTP trimming report
`stats/*xengsort_log.txt`	Xengsort metrics (present only when `--pdx` is used)
`stats/*_stat`	BWA alignment metrics.
`stats/*_AlignmentMetrics.txt`	GATK Alignment metrics.
`stats/*_CollectWgsMetrics.txt`	GATK collect WGS metrics output.
`stats/*_recal_data.table`	GATK Baserecalibration table.
`stats/*_dup_metrics.txt`	Picard mark duplicates output.
`stats/*_insert_size.txt`	Estimated library insert size.
`stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html`	FastQC report.
`stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html`	FastQC report.
`stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip`	FastqQC report.
`stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip`	FastqQC report.
`msi/*msisensor`	MSI Status. "The recommended msi score cutoff value is 20% (msi high: msi score >= 20%)"

Paired sample results:

`PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files`

Naming Convention	Description
`*_mutect2_somatic.filtered.vcf.gz`	VCF from GATK_MUTECT2 and GATK_FILTERMUTECTCALLS
`*_INDEL_filtered_dbsnpID.vcf`	Filtered unannotated INDELs only
`*_SNP_filtered_dbsnpID.vcf`	Filtered unannotated SNPs only
`*_SNP_INDEL_filtered_unannotated_final.vcf`	Final VCF file, with filtered unannotated INDEL and SNP calls
`*_SNP_INDEL_filtered_annotated_final.vcf`	Final VCF file, with filtered SNPeff annotated INDEL and SNP calls. See SNPEff notes below
`*_snpsift_finalTable.txt`	Extracted fields from final VCF, in tabular format. From SNPSIFT_EXTRACTFIELDS
`*_HRD_score.txt`	Homologous recombination deficiency score from scarHRD. The scarHRD authors suggest a common threshold for HRD positive sample as `HRD-sum > 55`. See the scarHRD manual for more information.
`*_somatic_TMB_Score.txt`	Tumor mutation burden score. High TMB is defined as 22 mutations/Mb. See note below on computation.
`/sequenza_cnv/*_segments.txt`	Table listing the detected segments, with estimated copy number states at each segment. See notes at https://bitbucket.org/sequenzatools/sequenza/src/master/ for table header information.
`/sequenza_cnv/*_segments.enstranscript_cnvbreak.txt`	Raw segments annotated with Ensembl transcript information.
`/sequenza_cnv/*_segments.ensgene_cnvbreak.txt`	Raw segments annotated with Ensembl gene information.
`/sequenza_cnv/*_segments_naWindowFiltered.txt`	Segments with `NA` windows filtered.
`/sequenza_cnv/*_segments_naWindowFiltered.enstranscript_cnvbreak.txt`	Segments with `NA` windows filtered annotated with Ensembl transcript information.
`/sequenza_cnv/*_segments_naWindowFiltered.ensgene_cnvbreak.txt`	Segments with `NA` windows filtered annotated with Ensembl gene information.
`/sequenza_cnv/pdfs/*alternative_fit.pdf`	Alternative solution fir to the segments. One solution per slide
`/sequenza_cnv/pdfs/*chromosome_depths.pdf`	Visualization of sequencing coverage in the normal and in the tumor samples, before and after normalization
`/sequenza_cnv/pdfs/*chromosome_view.pdf`	Visualization per chromosome of depth.ratio, B-allele frequency and mutations, using the selected or estimated solution. One chromosome per slide
`/sequenza_cnv/pdfs/*CN_bars.pdf`	Bar plot representing the percentage of genome in the detected copy number states
`/sequenza_cnv/pdfs/*CP_contours.pdf`	Visualization of the likelihood density for each pair of cellularity/ploidy solution. The local maximum-likelihood points and confidence interval of the best estimate are also visualized
`/sequenza_cnv/pdfs/*gc_plots.pdf`	Visualization of the GC correction in the normal and in the tumor sample
`/sequenza_cnv/pdfs/*genome_view.pdf`	Genome-whide visualization of the allele-specific and absolute copy number results, and raw profile of the depth ratio and allele frequency
`/sequenza_cnv/pdfs/*model_fit.pdf`	Summary plot of model fit
`/sequenza_cnv/Rdata/*sequenza_cp_table.RData`	RData object dump of the maxima a posteriori computation
`/sequenza_cnv/Rdata/*sequenza_extract.RData`	RData object dump of all the sample information
`/sequenza_cnv/txt/*_alternative_solutions.txt`	List of all ploidy/cellularity alternative solution
`/sequenza_cnv/txt/*_confints_CP.txt`	Table of the confidence inerval of the best solution from the model
`/sequenza_cnv/txt/*_mutations.txt`	Table with mutation and estimated number of mutated alleles (Mt)
`/sequenza_cnv/txt/*_segments.txt`	Table listing the detected segments, with estimated copy number state at each segment
`/sequenza_cnv/txt/*_sequenza_log.txt`	Log with Sequenza version and time information
`/sequenza_cnv/txt/*_sequenza_ploidy.txt`	Estimated sample ploidy from Sequenza
`/sequenza_cnv/txt/*_sequenza_purity.txt`	Estimated sample purity from Sequenza
`/stats/somatic.filtered.vcf.gz.filteringStats.tsv`	QC metrics from FilterMutectCalls

Additional result output:

Naming Convention	Description
`pta_report.html`	Nextflow autogenerated report.
`trace`	Nextflow autogenerated trace report for resource usage in tabular text format.
`multiqc`	MultiQC report summarizing quality metrics across samples in the analysis run.

NOTE: In the final VCF file *SNP_INDEL_filtered_annotated_final.vcf, the number of variants, will not match un-annotated variant counts (e.g., *SNP_INDEL_filtered_unannotated_final.vcf). This difference in variant count is a function of SNPeff annotation.

From the SNPeff documentation:

Counting variants / annotations

It is important to remember that the VCF format specification allows having multiple variants in a single line. Also, a single variant can have more than one annotation, due to:
  * Multiple transcripts (isoforms) of a gene (e.g. the human genome has on average 8.8 transcripts per gene)
  * Multiple (overlapping) genes in the genomic location of the variant.
  * A variant spanning multiple genes (e.g. a translocation, large deletion, etc.)
When you count the number of variants, you must keep all these in mind to count them properly. Obviously, SnpEff does take all this into account when counting the variants for the summary HTML.

Typical counting mistake

Many people who claim that there is a mismatch between the number of variants in the summary (HTML) file and the number of variants in the VCF file, are just making mistakes when counting the variants because they forget one or more of these previous items.

A typical scenario is, for example, that people are "counting missense variants" using something like this:

grep missense file.vcf | wc -l

This is counting "lines in a VCF file that have at least one missense variants", as opposed to counting "missense annotations" and, as mentioned previously, the number of lines in a VCF file is not the same as the number of annotations or the number of variants.

Pipeline Optional Outputs

Additional outputs will only be saved when --keep_intermediate true is specified. However, this option should generally not be used.

Tumor mutation burden (TMB) estimation:

TMB was calculated using variants that

(i) met all quality criteria (coverage, mapping quality etc.),
(ii) are likely somatic mutations, and
(iii) have a high or moderate functional impact (i.e., non-synonymous changes, frame-shifts, stop losses/gains, and splice-site acceptor/donor changes).

TMB was estimated by dividing the number of variants that met the criteria list above by the length (in Mb) of the target panel defined by the bed file parameter --target_gatk.

We defined high TMB as 22 mutations/Mb, which was calculated based on the TMB distribution of all Jackson Laboratory PDX models analyzed as follows: Q3 (third quartile of TMB) + 1.5 x inter-quartile range of TMB.

Home

Quick Start for JAX Users

Troubleshooting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly