Skip to content

WGS Pipeline ReadMe

MikeWLloyd edited this page Sep 13, 2024 · 21 revisions

Whole Genome Sequencing (WGS) Documentation

Whole Genome Sequencing Pipeline (--workflow wgs)

•	Step 1: FastP read and adapter trimming  
•	Step 2: Get Read Group Information  
•	Step 3: BWA-MEM Alignment  
•	Step 4: Picard SortSam and Mark Duplicates  

If Mouse:

•	Step 5: Collect Alignment Summary Metrics  
•	Step 6: Create Chromosome Channel  
•	Step 7: GATK Haplotype Caller on Each Chromosome  
•	Step 7a: GATK Haplotype Caller on Each Chromosome for GVCF  
•	Step 8: Merge VCFs  

If Human:

•	Step 5: Base Recalibrator and Apply BQSR  
•	Step 6: Create Chromosome Channel  
•	Step 7: GATK Haplotype Caller on Each Chromosome  
•	Step 7a: GATK Haplotype Caller on Each Chromosome for GVCF   
•	Step 8: Merge VCFs  

Both Mouse and Human:

•	Step 9: GATK Select Variants SNP and INDEL (separately)  
•	Step 10: VCF Annotate SNP and INDEL  

If Mouse:

•	Step 11: SnpEff and dbSFP  
•	Step 12: SnpSift Extract Fields  

If Human:

•	Step 11: Cosmic Annotation SNP and INDEL (separately)  
•	Step 12: SnpEff and DBNSFP  
•	Step 13: SnpSift Extract Fields  

Both Mouse and Human:

•	Step 14: Aggregate Stats  
•	Step 15: MultiQC report generation 

WGS Mouse Flowchart

flowchart TD
	p1((Sample))
	p2[FASTP]
	p3[FASTQC]
	%% p5[READ_GROUPS]
	p6[BWA_MEM]
    o1([Genomic BAM]):::output
    p7[PICARD_SORTSAM]
	p8[PICARD_MARKDUPLICATES]
	p11[PICARD_COLLECTWGSMETRICS]
	p12[PICARD_COLLECTALIGNMENTSUMMARYMETRICS]
	p14[GATK_HAPLOTYPECALLER]
    o3([Raw Variant Calls]):::output
    p18[GATK_SELECTVARIANTS_SNP]
    p18.5[GATK_VARIANTFILTRATION_SNP]
    p18.6[SNPSIFT_ANNOTATE_DBSNP_SNP]
	p19[GATK_SELECTVARIANTS_INDEL]
    p19.5[GATK_VARIANTFILTRATION_INDEL]
    p19.6[SNPSIFT_ANNOTATE_DBSNP_INDEL]
    p19.7[GATK_MERGEVCF]
	p20[SNPEFF]
	p21[SNPEFF_ONEPERLINE]
	p22[SNPSIFT_EXTRACTFIELDS]
    o4([Filtered SNPs]):::output
    o5([Filtered InDELs]):::output
    o6([Filtered Annotated VCF]):::output
    o7([Variant Table]):::output
	p33[MULTIQC]
    o8([MultiQC Report]):::output
    p1 --> |Raw Reads| p2
    subgraph alignment [  ]
    p2 --> p6
    p6 --> p7
    p7 --> p8
    p8 --> o1
    end

    subgraph calling [  ]
    o1 --> p14
    p14 --> o3
    o3 --> p18
    o3 --> p19
    p18 --> p18.5
    p18.5 --> p18.6
    p18.6 --> o4
    p19 --> p19.5
    p19.5 --> p19.6
    p19.6 --> o5
    o4 --> p19.7
    o5 --> p19.7
    p19.7 --> p20
    p20 --> p21
    p21 --> p22
    p20 --> o6
    p22 --> o7

    end
    subgraph qc [  ]
    p2 --> p3
    o1 --> p11
    o1 --> p12
    p2 --> p33
    p3 --> p33
    p8 --> p33
    p11 --> p33
    p12 --> p33
    p33 --> o8
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
Loading

WGS Human Flowchart

flowchart TD
	p1((Sample))
	p2[FASTP]
	p3[FASTQC]
	%% p5[READ_GROUPS]
	p6[BWA_MEM]
    o1([Genomic BAM]):::output
    p7[PICARD_SORTSAM]
	p8[PICARD_MARKDUPLICATES]
	p11[PICARD_COLLECTWGSMETRICS]
	p12[PICARD_COLLECTALIGNMENTSUMMARYMETRICS]
	p14[GATK_HAPLOTYPECALLER]
    o3([Raw Variant Calls]):::output
    p18[GATK_SELECTVARIANTS_SNP]
    p18.5[GATK_VARIANTFILTRATION_SNP]
    p18.6[SNPSIFT_ANNOTATE_DBSNP_SNP]
    p18.65[SNPSIFT_ANNOTATE_COSMIC_INDEL]
	p19[GATK_SELECTVARIANTS_INDEL]
    p19.5[GATK_VARIANTFILTRATION_INDEL]
    p19.6[SNPSIFT_ANNOTATE_DBSNP_INDEL]
    p19.65[SNPSIFT_ANNOTATE_COSMIC_INDEL]
    p19.8[GATK_MERGEVCF]
	p20[SNPEFF_SNP]
	p21[SNPEFF_ONEPERLINE_SNP]
    p20.5[SNPEFF_INDEL]
	p21.5[SNPEFF_ONEPERLINE_INDEL]
    p19.7[GATK_MERGEVCF]
	p22[SNPSIFT_EXTRACTFIELDS]
    
    o4([Filtered SNV VCF]):::output
    o5([Filtered InDELs]):::output
    o6([Filtered Annotated VCF]):::output
    o7([Variant Table]):::output
	p33[MULTIQC]
    o8([MultiQC Report]):::output
    p1 --> |Raw Reads| p2
    subgraph alignment [  ]
    p2 --> p6
    p6 --> p7
    p7 --> p8
    p8 --> o1
    end

    subgraph calling [  ]
    o1 --> p14
    p14 --> o3
    o3 --> p18
    o3 --> p19
    p18 --> p18.5
    p18.5 --> p18.6
    p18.6 --> p18.65
    p18.65 --> p19.8
    p19 --> p19.5
    p19.5 --> p19.6
    p19.6 --> p19.65
    p19.65 --> p19.8
    p19.8 --> o4
    p18.65 --> p20
    p20 --> p21
    p19.65 --> p20.5
    p20.5 --> p21.5
    p21 --> p19.7
    p21.5 --> p19.7
    %% o4 --> p19.7
    %% o5 --> p19.7

    p19.7 --> p22
    p19.7 --> o6
    p22 --> o7

    end
    subgraph qc [  ]
    p2 --> p3
    o1 --> p11
    o1 --> p12
    p2 --> p33
    p3 --> p33
    p8 --> p33
    p11 --> p33
    p12 --> p33
    p33 --> o8
    end

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
Loading

Parameters for WGS Pipeline

  • --pubdir

    • Default: /<PATH>
    • Comment: The directory that the saved outputs will be stored.
  • --organize_by

    • Default: sample
    • Comment: How to organize the output folder structure. Options: sample or analysis.
  • --cacheDir

    • Default: /projects/omics_share/meta/containers
    • Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
  • -w

    • Default: /<PATH>
    • Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
  • --sample_folder

    • Default: /<PATH>
    • Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
  • --extension

    • Default: .fastq.gz
    • Comment: The expected extension for the input read files.
  • --pattern

    • Default: "*_R{1,2}*"
    • Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this READ_NAME_R1_MoreText.fastq.gz or READ_NAME_R1.fastq.gz
  • --read_type

    • Default: PE
    • Comment: Options: PE and SE. Default: PE. Type of reads: paired end (PE) or single end (SE).
  • --concat_lanes

    • Default: false
    • Comment: Options: false and true. Default: false. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
  • --csv_input

    • Default: null
    • Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, * --download_data can be specified.
  • --download_data

    • Default: null
    • Comment: Requires * --csv_input. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
  • --gen_org

    • Default: mouse
    • Comment: Options: mouse and human.
  • --genome_build

    • Default: GRCm38
    • Comment: Mouse specific. Options: GRCm38 or GRCm39. If gen_org == human, build defaults to GRCh38.
  • --deduplicate_reads

    • Default: false
    • Comment: Options: false, true. If specified, run bbmap clumpify on input reads. Clumpify will deduplicate reads prior to trimming. This can help with mapping and downstream steps when analyzing high coverage WGS data.
  • --coverage_cap

  • --primary_chrom_bed

    • Default: '/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/Homo_sapiens_assembly38.primary_chrom.bed'
    • Comment: A bed file containing the primary chromsomes with positions. Used in limiting jvarkit 'Biostar154220' to those regions with expected coverage.
  • --split_fastq

    • Default: false
    • Comment: If specified, FASTQ files will be split into chunks sized based on split_fastq_bin_size prior to mapping. This option is recommended for high coverage data.
  • --split_fastq_bin_size

    • Default: 10000000
    • Comment: If split_fastq is specified, FASTQ files will splint into chunks of this size prior to mapping.
  • --quality_phred

    • Default: 15
    • Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
  • --unqualified_perc

    • Default: 40
    • Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
  • --detect_adapter_for_pe

    • Default: false
    • Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
  • --ref_fa

    • Default: '/projects/omics_share/mouse/GRCm38/genome/sequence/ensembl/v102/Mus_musculus.GRCm38.dna.toplevel.fa'
    • Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis, points to human reference when * --gen_org human. JAX users should not change this parameter.
  • --ref_fa_indices

    • Default: '/projects/omics_share/mouse/GRCm38/genome/indices/ensembl/v102/bwa/Mus_musculus.GRCm38.dna.toplevel.fa'
    • Comment: Pre-compiled BWA index files, points to human reference when * --gen_org human. JAX users should not change this parameter.
  • --chrom_contigs

    • Default: '/projects/omics_share/mouse/GRCm38/genome/sequence/ensembl/v102/Mus_musculus.GRCm38.dna.toplevel.primaryChr.contig_list'
    • Comment: A list of primary chromosomes, present in the reference file, points to human reference when * --gen_org human. Used to scatter variant calling by chromosome. JAX users should not change this parameter.
  • --mismatch_penalty

    • Default: -B 8
    • Comment: The BWA penalty for a mismatch.
  • --call_val

    • Default: 50
    • Comment: The minimum phred-scaled confidence threshold at which variants should be called.
  • --ploidy_val

    • Default: "-ploidy 2"
    • Comment: Sample ploidy
  • --dbSNP

    • Default: '/projects/omics_share/mouse/GRCm38/genome/annotation/snps_indels/GCA_000001635.6_current_ids.vcf.gz'
    • Comment: The dbSNP database contains known single nucleotide polymorphisms, and is used in the annotation of known variants. Points to human dbSNP when * --gen_org human.
  • --gen_ver

    • Default: "GRCm38.99"
    • Comment: snpEff genome version. Sets to 'hg38' when * --gen_org human
  • --snpEff_config

    • Default: "/projects/omics_share/mouse/GRCm38/genome/indices/snpEff_5_1/snpEff.config"
    • Comment: The configuration file used while running snpEff, points to human snpEff file when * --gen_org human. JAX users should not change this parameter.
  • --gold_std_indels

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz’
    • Comment: Human Only - Used in GATK BaseRecalibrator. JAX users should not change this parameter.
  • --phase1_1000G

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
    • Comment: Human Only - Used in GATK BaseRecalibrator. JAX users should not change this parameter.
  • --dbNSFP

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/function/dbNSFP4.2a.gatk_formatted.txt.gz'
    • Comment: Used in variant annotation.
  • --dbSNP_index

    • Default: '/projects/omics_share/mouse/GRCm38/genome/annotation/snps_indels/GCA_000001635.6_current_ids.vcf.gz.tbi'
    • Comment: Index file for dbSNP
  • --cosmic

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/function/COSMICv95_Coding_Noncoding.gatk_formatted.vcf'
    • Comment: Human Only - Used in variant annotation.
  • --cosmic_index

    • Default: '/projects/omics_share/human/GRCh38/genome/annotation/function/COSMICv95_Coding_Noncoding.gatk_formatted.vcf.gz.tbi'
    • Comment: Human Only - Index file for cosmic annotation file.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention Description
wgs_report.html Nextflow autogenerated report
multiqc MultiQC report summarizing quality metrics across samples in the analysis run.
*/bam/*.bam Final BAM alignment file
*/bam/*.bai Final BAM index file
*/*_GATKcombined_raw.vcf VCF from GATK_HAPLOTYPECALLER
*/*_GATKcombined_raw.gvcf GVCF output from GATK_HAPLOTYPECALLER. When run_gvcf = true.
*/*_INDEL_filtered_dbsnpID.vcf Filtered unannotated INDELs only
*/*_SNP_filtered_dbsnpID.vcf Filtered unannotated SNPs only
*/*SNP_INDEL_filtered_unannotated_final.vcf Final VCF file, with filtered unannotated INDEL and SNP calls
*/*SNP_INDEL_filtered_annotated_final.vcf Final VCF file, with filtered SNPeff annotated INDEL and SNP calls. See SNPEff notes below
*/*_snpsift_finalTable.txt Extracted fields from final VCF, in tabular format. From SNPSIFT_EXTRACTFIELDS
*/stats/*fastp_report.html FastP trimming metrics
*/stats/*_dup_metrics.txt Duplicate metrics from PICARD_MARKDUPLICATES
*/stats/*_recal_data.table (human only) recalibrated table for BQSR from GATK_BASERECALIBRATOR
*/stats/*_AlignmentMetrics.txt Alignment metrics from PICARD_COLLECTALIGNMENTSUMARYMETRICS

NOTE: In the final VCF file *SNP_INDEL_filtered_unannotated_final.vcf, the number of variants, will not match un-annotated variant counts (e.g., *SNP_INDEL_filtered_annotated_final.vcf). This difference in variant count is a function of SNPeff annotation.

From the SNPeff documentation:

Counting variants / annotations

It is important to remember that the VCF format specification allows having multiple variants in a single line. Also, a single variant can have more than one annotation, due to:

  * Multiple transcripts (isoforms) of a gene (e.g. the human genome has on average 8.8 transcripts per gene)
  * Multiple (overlapping) genes in the genomic location of the variant.
  * A variant spanning multiple genes (e.g. a translocation, large deletion, etc.)

When you count the number of variants, you must keep all these in mind to count them properly. Obviously, SnpEff does take all this into account when counting the variants for the summary HTML.

Typical counting mistake

Many people who claim that there is a mismatch between the number of variants in the summary (HTML) file and the number of variants in the VCF file, are just making mistakes when counting the variants because they forget one or more of these previous items.

A typical scenario is, for example, that people are "counting missense variants" using something like this:

grep missense file.vcf | wc -l

This is counting "lines in a VCF file that have at least one missense variants", as opposed to counting "missense annotations" and, as mentioned previously, the number of lines in a VCF file is not the same as the number of annotations or the number of variants.

Pipeline Options Outputs

These output will only be saved when --keep_intermediate true is specified.

Naming Convention Description (--keep_intermediate true)
*/*_read_group.txt Read groups for fastq files from READ_GROUPS
*/*.sam Sam alignment file from BWA_MEM
*/*.aligner.intervals Aligner intervals from GATK_REALIGNERTARGETCREATOR
*/bam/*_sortsam.bam Sorted bam from PICARD_SORTSAM
*/bam/*_realigned_BQSR.bam (human only) output from GATK_APPLYBQSR

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

  • The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
  • The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
  • The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example PE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz

An example SE csv file:

sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz
Clone this wiki locally