-
Notifications
You must be signed in to change notification settings - Fork 10
Somatic WES Pipeline ReadMe
• Step 1: FastP read and adapter trimming
• Step 2: Get Read Group Information
• Step 3 (optional, run for PDX): Xengsort human / mouse read disambiguation
• Step 4: BWA-MEM Alignment
• Step 5: Variant Preprocessing - Part 1 (Picard sortsam/mark duplicates)
• Step 6: Variant Pre-Processing – Part 2 (GATK Base Recalibrator Apply BQSR)
• Step 7: Sample contamination analysis & if FFPE: read orientation modeling
• Step 8: Variant Pre-Processing – Part 2 (GATK Base Recalibrator Apply BQSR)
• Step 9: Microsatellite Instability analysis with MSIsensor2
• Step 10: Variant Calling (GATK Mutect2)
• Step 11: Variant Filtration (GATK FilterMutectCalls)
• Step 12: Post Variant Calling Annotation - Part 1 (Cosmic, SnpEff, SnpSift)
• Step 13: Tumor mutation burden calling
• Step 14: Picard Collect HS Metrics
• Step 15: MultiQC report generation
flowchart TD
p1((Sample))
p2[FASTP]
p3[FASTQC]
p4[XENGSORT_CLASSIFY]
p6[BWA_MEM]
o1([Genomic BAM]):::output
p7[PICARD_SORTSAM]
p8[PICARD_MARKDUPLICATES]
p9[GATK_BASERECALIBRATOR]
p10[GATK_APPLYBQSR]
pn1[GATK_GETPILEUPSUMMARIES]
pn2[GATK_CALCULATECONTAMINATION]
pn3[GATK_LEARNREADORIENTATIONMODEL]
p11[PICARD_COLLECTHSMETRICS]
p12[MSISENSOR2_MSI]
o2([MSI Status]):::output
p14[GATK_MUTECT2]
p15[GATK_FILTERMUECTCALLS]
o3([Raw Variant Calls]):::output
p16[GATK_SELECTVARIANTS_SNP]
p17[GATK_VARIANTFILTRATION_SNP]
p18[GATK_SELECTVARIANTS_INDEL]
p19[GATK_VARIANTFILTRATION_INDEL]
p20[SNPSIFT_ANNOTATE_SNP_DBSNP]
p21[SNPSIFT_ANNOTATE_SNP_COSMIC]
p22[SNPEFF_SNP]
p23[SNPSIFT_DBNSFP_SNP]
p24[SNPEFF_ONEPERLINE_SNP]
p25[SNPSIFT_ANNOTATE_INDEL_DBSNP]
p26[SNPSIFT_ANNOTATE_INDEL_COSMIC]
p27[SNPEFF_INDEL]
p28[SNPSIFT_DBNSFP_INDEL]
p29[SNPEFF_ONEPERLINE_INDEL]
p30[GATK_MERGEVCF_UNANNOTATED]
o4([Filtered Unannoated VCF]):::output
p31[GATK_MERGEVCF_ANNOTATED]
o5([Filtered Annotated VCF]):::output
p32[SNPSIFT_EXTRACTFIELDS]
o6([Variant Table]):::output
tmb1[TMB_SCORE]
o8([Tumor Mutation\nBurden Score]):::output
p33[MULTIQC]
o7([MultiQC Report]):::output
p1 --> |Raw Reads| p2
subgraph alignment [ ]
p2 -.PDX.-> p4
p2 --Non PDX--> p6
%% p2 --> p5
p4 -..-> |PDX:\nHuman Reads| p6
%% p5 --> p6
p6 --> p7
p7 --> p8
p8 --> p9
p9 --> p10
p10 --> o1
end
subgraph msi [ ]
o1 --> p12
p12 --> o2
end
subgraph calling [ ]
o1 --> p14
o1 --> pn1
pn1 --> pn2
o1 -.FFPE.-> pn3
p14 --> p15
pn2 --> p15
pn3 -.FFPE.-> p15
p15 --> o3
o3 --> p16
o3 --> p18
p16 --> p17
p18 --> p19
p17 --> p20
p20 --> p21
p21 --> p22
p22 --> p23
p23 --> p24
p19 --> p25
p25 --> p26
p26 --> p27
p27 --> p28
p28 --> p29
p17 --> p30
p19 --> p30
p30 --> o4
p24 --> p31
p29 --> p31
p31 --> o5
o4 --> p32
p32 --> o6
o4 --> tmb1
tmb1 --> o8
end
subgraph qc [ ]
p2 --> p3
o1 --> p11
p2 --> p33
p3 --> p33
p4 -.PDX.-> p33
p8 --> p33
p9 --> p33
p11 --> p33
p15 --> p33
p33 --> o7
end
an3[BCFTOOLS_MPILEUP]
an4[BCFTOOLS_CALL]
an5[BCFTOOLS_FILTER]
an6[BCFTOOLS_ANNOTATE]
an7[VCF2EIGENSTRAT]
an8[SNPWEIGHTS_INFERANC]
ano1([Genetic Ancestry Estimation]):::output
subgraph ancestry [ ]
o1 --> an3
an3 --> an4
an4 --> an5
an5 --> an6
an6 --> an7
an7 --> an8
an8 --> ano1
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style ancestry stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
style msi stroke:#333,stroke-width:2px
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
--cacheDir
- Default:
/projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Default:PE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. Default:false
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, *
--download_data
can be specified.
-
--download_data
- Default: null
- Comment: Requires *
--csv_input
. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
-
--gen_org
- Default:
human
- Comment: Options:
human
.
- Default:
-
--ref_fa
- Default:
'/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis. JAX users should not change this parameter.
- Default:
-
--ref_fa_indices
- Default:
'/projects/omics_share/human/GRCh38/genome/indices/gatk/bwa/Homo_sapiens_assembly38.fasta'
- Comment: Pre-compiled BWA index files. JAX users should not change this parameter.
- Default:
-
--quality_phred
- Default:
15
- Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
- Default:
-
--unqualified_perc
- Default:
40
- Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
- Default:
-
--detect_adapter_for_pe
- Default:
false
- Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
- Default:
-
--pdx
- Default:
false
- Comment: Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis.
- Default:
-
--xengsort_host_fasta
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
- Comment: Xengsort host fasta file. Used by Xengsort Index when
--pdx
is run, and xengsort_idx_path isnull
or false.
- Default:
-
--xengsort_idx_path
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort'
- Comment: Xengsort index for deconvolution of human and mouse reads. Used when
--pdx
is run. Ifnull
, Xengsort Index is run using ref_fa and host_fa.
- Default:
-
--xengsort_idx_name
- Default:
'hg38_GRCm39-NOD_ShiLtJ'
- Comment: Xengsort index name associated with files located in
xengsort_idx_path
or name given to outputs produced by Xengsort Index.
- Default:
-
--ffpe
- Default:
false
- Comment: Options: false, true. If specified for FFPE derived samples, GATK LearnReadOrientationModel is run (per GATK best practices) and used as an additional filter of somatic calls.
- Default:
-
--hg38_windows
- Default:
/projects/compsci/omics_share//human/GRCh38/genome/annotation/intervals/hg38_chrom_sizes.window.1000000.bed
- Comment: GRCh38 broken into 1000000bp windows. This file is used in tumor mutation burden calculation.
- Default:
-
--genotype_targets
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file for the ancestry panel. Can contain annotation information.
- Default:
-
--snpID_list
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2.list'
- Comment: Target SNPs in list used in BCFtools filtering step.
- Default:
-
--snp_annotations
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file with annotations for the ancestry panel.
- Default:
-
--snpweights_panel
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/ancestry_panel_v2.snpwt'
- Comment: SNP weights panel in the appropriate format.
- Default:
-
--target_gatk
- Default:
'/projects/omics_share/human/GRCh38/supporting_files/capture_kit_files/agilent/v7/S31285117_MergedProbes_no_gene_names.bed'
- Comment: A bed file with WES target intervals as defined in the capture array used in the data. NOTE: This file MUST reflect the capture array used to generate your data.
- Default:
-
--target_picard
- Default:
'/projects/omics_share/human/GRCh38/supporting_files/capture_kit_files/agilent/v7/S31285117_MergedProbes_no_gene_names.picard.interval_list'
- Comment: A GATK interval file covering WES target intervals. Used in calculating coverage metrics. NOTE: This file MUST reflect the capture array used to generate your data.
- Default:
-
--bait_picard
- Default:
'/projects/omics_share/human/GRCh38/supporting_files/capture_kit_files/agilent/v7/S31285117_MergedProbes_no_gene_names.picard.interval_list'
- Comment: A GATK interval file covering WES target intervals. Used in calculating coverage metrics. This file can be the same as the interval file, NOTE: This file MUST reflect the capture array used to generate your data.
- Default:
-
--mismatch_penalty
- Default:
-B 8
- Comment: The BWA penalty for a mismatch.
- Default:
-
--gnomad_ref
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/af-only-gnomad.hg38.vcf.gz'
- Comment: GnomAD germline reference from GATK resource pack.
- Default:
-
--pon_ref
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/1000g_pon.hg38.vcf.gz'
- Comment: 1000 genome germline panel of normals from GATK resource pack.
- Default:
-
--genotype_pon
- Default:
true
- Comment: Call sites in the PoN even though they will ultimately be filtered.
- Default:
-
--genotype_germline
- Default:
true
- Comment: Call all apparent germline site even though they will ultimately be filtered.
- Default:
-
--contam_ref
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/small_exac_common_3.hg38.vcf.gz'
- Comment: File used in GetPileupSummaries and CalculateContaminationcommon. A germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field is used from GATK resource bundle.
- Default:
-
--dbSNP
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz'
- Comment: The dbSNP database contains known single nucleotide polymorphisms, and is used in the annotation of known variants. JAX users should not change this parameter.
- Default:
-
--dbSNP_index
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz.tbi'
- Comment: dbDNP index file.
- Default:
-
--gen_ver
- Default:
"hg38"
- Comment: snpEff genome version.
- Default:
-
--snpEff_config
- Default:
'/projects/omics_share/human/GRCh38/genome/indices/snpEff_5_1/snpEff.config'
- Comment: The configuration file used while running snpEff. JAX users should not change this parameter.
- Default:
-
--gold_std_indels
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz’
- Comment: Used in GATK BaseRecalibrator. JAX users should not change this parameter.
- Default:
-
--phase1_1000G
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
- Comment: Used in GATK BaseRecalibrator. JAX users should not change this parameter.
- Default:
-
--dbNSFP
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/function/dbNSFP4.2a.gatk_formatted.txt.gz'
- Comment: Used in variant annotation.
- Default:
-
--cosmic
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/function/COSMICv95_Coding_Noncoding.gatk_formatted.vcf'
- Comment: COSMIC annotations.
- Default:
-
--cosmic_index
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/function/COSMICv95_Coding_Noncoding.gatk_formatted.vcf.gz.tbi'
- Comment: COSMIC annotation index file.
- Default:
-
--msisensor_model
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/msisensor2/models_hg38'
- Comment: MSIsensor2 model files
- Default:
-
--multiqc_config
- Default:
${projectDir}/bin/shared/multiqc/somatic_wes_multiqc.yaml
- Comment: The path to the configuration file used by MultiQC
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
wes_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
multiqc |
MultiQC report summarizing quality metrics across samples in the analysis run. |
*/bam/*_realigned_BQSR.bam |
Final BAM alignment file |
*/bam/*_realigned_BQSR.bai |
Final BAM index file |
*/*_mutect2_somatic.filtered.vcf.gz |
VCF from GATK_MUTECT2 |
*/*_INDEL_filtered_dbsnpID.vcf |
Filtered unannotated INDELs only |
*/*_SNP_filtered_dbsnpID.vcf |
Filtered unannotated SNPs only |
*/*SNP_INDEL_filtered_unannotated_final.vcf |
Final VCF file, with filtered unannotated INDEL and SNP calls |
*/*SNP_INDEL_filtered_annotated_final.vcf |
Final VCF file, with filtered SNPeff annotated INDEL and SNP calls. See SNPEff notes below |
*/*_somatic_TMB_Score.txt |
Tumor mutation burden score. High TMB is defined as 22 mutations/Mb. See note below on computation. |
*/*_snpsift_finalTable.txt |
Extracted fields from final VCF, in tabular format. From SNPSIFT_EXTRACTFIELDS |
*/*_somatic_TMB_Score.txt |
Tumor mutation burden score. High TMB is defined as 22 mutations/Mb. See note below on computation. |
*/*ancestry.tsv |
Genetic ancestry report. See https://www.biorxiv.org/content/10.1101/2022.10.24.513591v1 for details on report and methods |
msi/*msisensor |
MSI Status. "The recommended msi score cutoff value is 20% (msi high: msi score >= 20%)" |
*/stats/*xengsort_log.txt |
Xengsort statistics file, when --pdx is run. |
*/stats/*fastp_report.html |
FastP trimming metrics |
*/stats/*_dup_metrics.txt |
Duplicate metrics from PICARD_MARKDUPLICATES |
*/stats/*_recal_data.table |
Recalibrated table for BQSR from GATK_BASERECALIBRATOR |
*/stats/*_CoverageMetrics.txt |
Coverage metrics from PICARD_COLLECTHSMETRICS |
*/stats/*filteringStats.tsv |
QC metrics from FilterMutectCalls |
NOTE: In the final VCF file *SNP_INDEL_filtered_unannotated_final.vcf
, the number of variants, will not match un-annotated variant counts (e.g., *SNP_INDEL_filtered_annotated_final.vcf
). This difference in variant count is a function of SNPeff annotation.
From the SNPeff documentation:
It is important to remember that the VCF format specification allows having multiple variants in a single line. Also, a single variant can have more than one annotation, due to:
* Multiple transcripts (isoforms) of a gene (e.g. the human genome has on average 8.8 transcripts per gene) * Multiple (overlapping) genes in the genomic location of the variant. * A variant spanning multiple genes (e.g. a translocation, large deletion, etc.)
When you count the number of variants, you must keep all these in mind to count them properly. Obviously, SnpEff does take all this into account when counting the variants for the summary HTML.
Many people who claim that there is a mismatch between the number of variants in the summary (HTML) file and the number of variants in the VCF file, are just making mistakes when counting the variants because they forget one or more of these previous items.
A typical scenario is, for example, that people are "counting missense variants" using something like this:
grep missense file.vcf | wc -l
This is counting "lines in a VCF file that have at least one missense variants", as opposed to counting "missense annotations" and, as mentioned previously, the number of lines in a VCF file is not the same as the number of annotations or the number of variants.
Additional outputs will only be saved when --keep_intermediate true
is specified. However, this option should generally not be used.
TMB was calculated using variants that
(i) met all quality criteria (coverage, mapping quality etc.),
(ii) are likely somatic mutations, and
(iii) have a high or moderate functional impact (i.e., non-synonymous changes, frame-shifts, stop losses/gains, and splice-site acceptor/donor changes).
TMB was estimated by dividing the number of variants that met the criteria list above by the length (in Mb) of the target panel defined by the bed file parameter --target_gatk
.
We defined high TMB as 22 mutations/Mb, which was calculated based on the TMB distribution of all Jackson Laboratory PDX models analyzed as follows: Q3 (third quartile of TMB) + 1.5 x inter-quartile range of TMB.
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz