-
Notifications
You must be signed in to change notification settings - Fork 10
Human PTA ReadMe
For all input samples:
• Fastp read quality and adapter trimming
• Get Read Group Information
• BWA-MEM Alignment
• Picard SortSam and Mark Duplicates
• Base Recalibrator and Apply BQSR
• Collect Alignment Summary Metrics
If paired sample:
• Conpair for sample contamination analysis
• Germline variant calling
• Germline variant filtering
• Germline variant annotation
For all samples (NA12878 used if unpaired):
• GATK Mutect2 variant caller
• Filter Mutect2 calls
• Lancet variant caller
• Manta SV caller
• Strelka2 SNV and SV caller
• Gridss SV caller
• Gripss SV filter of Gridss calls
• Bicseq2 sample normalization for normal and tumor
• Bicseq2 segmentation CNV calling (paired for paired, or on single sample for unpaired)
• MSIsensor2 MSI calling
• VCF Merge prep steps. A number of scripts are applied to ensure samples within VCF files are properly oriented prior to merging, and calls are properly formatted
• Intergenic variant rescue and confirmation via Lancet. A number of scripts are applied to determine if intergenic variants are supported by the caller Lancet
• SNV VCF files are merged across all SNV callers
• Panel of normal filtering is applied
• VEP annotation of merged SNVs
• COSMIC cancer resistance mutations, and COSMIC cancer gene annotations are added to merged SNV calls
• Bicseq2 CNV calls are annotated
• SV calls are merged across callers
• SV calls are annotated
• SV calls are annotated with CNV regions
• A final filter is applied to CNV annotated SV calls
• MultiQC report generation
flowchart TD
p00((CSV Sample Sheet))
p01[PARSE_SAMPLE_SHEET:CONCATENATE_READS]
p00 --> p01
p01 --> |Tumor Sample| t02
p01 --> |"`Normal Sample:
**If Normal Sample
Provided**`"| n02
subgraph tumor [ ]
t02[FASTP]
opt1[XENGSORT_CLASSIFY]
t05[BWA_MEM]
t06[PICARD_SORTSAM]
t07[SHORT_ALIGNMENT_MARKING]
t08[PICARD_CLEANSAM]
t09[PICARD_FIX_MATE_INFORMATION]
t10[PICARD_MARKDUPLICATES]
t11[GATK_BASERECALIBRATOR]
t12[GATK_APPLYBQSR]
%% t16[GATK_GETSAMPLENAME_TUMOR]
t01([Tumor Genomic Bam]):::output
t02 -..-> |PDX Sample| opt1
t02 --> |Human Sample| t05
opt1 -..-> |Human Reads| t05
t05 --> t06
t06 --> t07
t07 --> t08
t08 --> t09
t09 --> t10
t10 --> t11
t11 --> t12
t12 --> t01
end
subgraph normal [ ]
n02[FASTP]
n05[BWA_MEM]
n06[PICARD_SORTSAM]
n07[SHORT_ALIGNMENT_MARKING]
n08[PICARD_CLEANSAM]
n09[PICARD_FIX_MATE_INFORMATION]
n10[PICARD_MARKDUPLICATES]
n11[GATK_BASERECALIBRATOR]
n12[GATK_APPLYBQSR]
%% n15[GATK_GETSAMPLENAME_NORMAL]
n01([Normal Genomic Bam]):::output
n02 --> n05
n05 --> n06
n06 --> n07
n07 --> n08
n08 --> n09
n09 --> n10
n10 --> n11
n11 --> n12
n12 --> n01
end
subgraph germline [ ]
n01 -..-> |If Normal\nSample Provided|p20
p20[GATK_HAPLOTYPECALLER_SV_GERMLINE]
p21[GATK_SORTVCF_GERMLINE]
p22[GATK_GENOTYPE_GVCF]
p23[GATK_CNNSCORE_VARIANTS]
p24[GATK_SORTVCF_GENOTYPE]
p25[GATK_FILTER_VARIANT_TRANCHES]
p26[GATK_VARIANTFILTRATION_AF]
p27[BCFTOOLS_GERMLINE_FILTER]
%% p28[BCFTOOLS_SPLITMULTIALLELIC_REGIONS]
%% p29[VEP_GERMLINE]
%% p30[BCFTOOLS_REMOVESPANNING]
%% p31[COSMIC_ANNOTATION]
%% p32[COSMIC_CANCER_RESISTANCE_MUTATION_GERMLINE]
%% p33[SNPSIFT_ANNOTATE_DBSNP_GERMLINE]
%% p34[GERMLINE_VCF_FINALIZATION]
annot_summary1{{Germline Annotation via VEP and COSMIC\nSteps truncated for figure clarity}}
o3([Germline Variants]):::output
o4([Annotated Germline Variants]):::output
p20 --> p21
p21 --> p22
p22 --> p23
p23 --> p24
p24 --> p25
p25 --> p26
p26 --> p27
p27 --> o3
o3 --> annot_summary1
subgraph germline_annot [ ]
%% p27 --> p28
%% p28 --> p29
%% p29 --> p30
%% p30 --> p31
%% p31 --> p32
%% p32 --> p33
%% p33 --> p34
%% p34 --> o4
annot_summary1 --> o4
end
end
n01 -..-> |If Normal\nSample Provided| m1
altBAM -..-> |If No Normal\nSample Provided| m1
t01 --> m1
altBAM[(NA12878\nBAM)]
m1 --> p35
m1 --> p39
m1 --> p41
m1 --> p42
m1 --> p43
%% m1 --> p52
%% m1 --> p53
m1 --> summary1
m1((Join:\nTumor & Normal))
%% m1 -..-> |If Normal\nSample Provided| p17
%% m1 -..-> |If Normal\nSample Provided| p18
m1 -..-> |If Normal\nSample Provided| summary2
subgraph conpair [ ]
%% p17[CONPAIR_NORMAL_PILEUP]
%% p18[CONPAIR_TUMOR_PILEUP]
summary2{{CONPAIR: Tumore and Normal Pileups}}
p19[CONPAIR]
o2([Conpair sample contam. results]):::output
%% p17 --> p19
%% p18 --> p19
summary2 --> p19
p19 --> o2
end
subgraph somatic_variant [ ]
p35[GATK_MUTECT2]
p36[GATK_SORTVCF_MUTECT]
p37[GATK_MERGEMUTECTSTATS]
p38[GATK_FILTERMUECTCALLS]
o5([Mutect2 SNV Calls]):::output
p39[LANCET]
p40[GATK_SORTVCF_LANCET]
o6([Lancet SNV Calls]):::output
p41[MANTA]
o7([Manta SV and SNV Calls]):::output
p42[STRELKA2]
o8([Strelka2 SNV Calls]):::output
p43[GRIDSS_PREPROCESS]
p44[GRIDSS_ASSEMBLE]
p45[GRIDSS_CALLING]
p46[GRIDSS_CHROM_FILTER]
p47[GRIPSS_SOMATIC_FILTER]
o9([Gridss SNV Calls]):::output
%% p48[SAMTOOLS_STATS_INSERTSIZE_NORMAL]
%% p49[SAMTOOLS_STATS_INSERTSIZE_TUMOR]
%% p50[SAMTOOLS_FILTER_UNIQUE_NORMAL]
%% p51[SAMTOOLS_FILTER_UNIQUE_TUMOR]
%% p52[BICSEQ2_NORMALIZE_NORMAL]
%% p53[BICSEQ2_NORMALIZE_TUMOR]
summary1{{BICSEQ2\nPreprocessing Steps Truncated for Clarity}}
p54[BICSEQ2_SEG]
p55[BICSEQ2_SEG_UNPAIRED]
o10([BICSEQ2 CNV Calls]):::output
p35 --> p36
p36 --> p37
p37 --> p38
p38 --> o5
p39 --> p40
p40 --> o6
p41 --> o7
o7 --> p42
p42 --> o8
p43 --> p44
p44 --> p45
p45 --> p46
p46 --> p47
p47 --> o9
summary1 --> p54
summary1 -..-> |If No Normal\nSample Provided| p55
p54 --> o10
p55 --> o10
note1{{Merge prep.\nSteps truncated}}
note2{{Merge prep.\nSteps truncated}}
note3{{Merge prep.\nSteps truncated}}
notea4{{Merge prep.\nSteps truncated}}
%% p57[RENAME_METADATA]
%% p58[MERGE_PREP]
%% p59[RENAME_VCF]
%% p60[COMPRESS_INDEX_VCF]
%% p61[BCFTOOLS_SPLITMULTIALLELIC]
%% p62[SPLIT_MNV]
%% p63[GATK_SORTVCF_TOOLS]
o5 --> note1
o6 --> note2
o7 --> notea4
o8 --> note3
end
note1 --> p64
note2 --> p64
note3 --> p64
notea4 --> p64
p64[BCFTOOLS_MERGECALLERS]
%% p65[COMPRESS_INDEX_VCF_ALL_CALLERS]
p64 --> note4
subgraph lancet_confirm [ ]
%% NOTE: There are many lancet confirm steps. First prep, then confirm, then re-merge.
%% p66[BEDTOOLS_STARTCANDIDATES]
%% p67[GET_CANDIDATES]
%% p68[COMPRESS_INDEX_VCF_REGION]
%% p69[VCF_TO_BED]
%% p70[LANCET_CONFIRM]
%% p71[COMPRESS_INDEX_VCF_REGION_LANCET]
%% p72[BCFTOOLS_INTERSECTVCFS]
%% p73[RENAME_METADATA_LANCET]
%% p74[MERGE_PREP_LANCET]
%% p75[RENAME_VCF_LANCET]
%% p76[COMPRESS_INDEX_VCF_LANCET]
%% p77[BCFTOOLS_SPLITMULTIALLELIC_LANCET]
%% p78[SPLIT_MNV_LANCET]
%% p79[REMOVE_CONTIG]
%% p80[GATK_SORTVCF_TOOLS_LANCET]
%% p81[BCFTOOLS_MERGECALLERS_FINAL]
%% p82[COMPRESS_INDEX_VCF_MERGED]
%% p83[MERGE_COLUMNS]
%% p84[ADD_NYGC_ALLELE_COUNTS]
%% p85[ADD_FINAL_ALLELE_COUNTS]
%% p86[FILTER_PON]
%% p87[FILTER_VCF]
%% p88[SNV_TO_MNV_FINAL_FILTER]
%% p89[GATK_SORTVCF_SOMATIC]
%% p90[REORDER_VCF_COLUMNS]
%% p91[COMPRESS_INDEX_MERGED_VCF]
note0[[The steps in this subgraph are truncated for clarity]]
note4{{Extract non exonic variant calls}}
note5{{Confirm non exonic variant calls with Lancet}}
note6{{Merge Lancet confirmed to exonic calls}}
note4 --> note5
note5 --> note6
%% note6 --> note7
end
subgraph snv_annotate [ ]
p92[VEP_SOMATIC]
p93[COSMIC_ANNOTATION_SOMATIC]
p94[COSMIC_CANCER_RESISTANCE_MUTATION_SOMATIC]
p95[SNPSIFT_ANNOTATE_DBSNP_SOMATIC]
p96[SOMATIC_VCF_FINALIZATION]
o11([Annotated filtered\nSomatic SNV and InDELs Calls]):::output
%% p105[:FILTER_BEDPE]
%% p106[:FILTER_BEDPE_SUPPLEMENTAL]
note6 --> p92
p92 --> p93
p93 --> p94
p94 --> p95
p95 --> p96
p96 --> o11
%% note2{{}}
end
subgraph cnv_sv_annotate [ ]
o10 --> p97
p97[ANNOTATE_BICSEQ2_CNV]
o12([Annotated CNV Regions])
o7 --> p98
o9 --> p98
p98[MERGE_SV]
p99[ANNOTATE_SV]
p100[ANNOTATE_SV_SUPPLEMENTAL]
p101[ANNOTATE_GENES_SV]
p102[ANNOTATE_GENES_SV_SUPPLEMENTAL]
p103[ANNOTATE_SV_WITH_CNV]
p104[ANNOTATE_SV_WITH_CNV_SUPPLEMENTAL]
o13([Annotated SV Calls])
o14([Annotated SV])
p97 --> o12
p98 --> p99
p99 --> p100
p100 --> p101
p101 --> p102
p102 --> o13
p102 --> p103
p97 --> p103
p103 --> p104
p104 --> o14
end
o11 ~~~ note42
subgraph qc [ ]
temp0((Fastq Files\nFrom Above))
temp1((Genomic BAMs\nFrom Above))
temp2((Logs from:\nBWA\nTrimming\nMark Duplicates))
p03[FASTQC]
p13[PICARD_COLLECTALIGNMENTSUMMARYMETRICS]
p14[PICARD_COLLECTWGSMETRICS]
p142[MULTIQC]
o15([MultQC Report]):::output
note42[[For clarity\nQC steps not connected to main graph]]
temp0 --> p03
temp1 --> p13
temp1 --> p14
temp2 --> p142
p03 --> p142
p13 --> p142
p14 --> p142
p142 --> o15
%% n02 --> p03
%% t02 --> p03
%% to1 --> p13
%% to1 --> p14
%% no1 --> p13
%% no1 --> p14
end
subgraph msi [ ]
t01 --> p56[MSISENSOR2_MSI] --> msi_output([MSI Status])
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style tumor stroke:#333,stroke-width:2px
style normal stroke:#333,stroke-width:2px
style germline stroke:#333,stroke-width:2px
style conpair stroke:#333,stroke-width:2px
style somatic_variant stroke:#333,stroke-width:2px
style lancet_confirm stroke:#333,stroke-width:2px
style snv_annotate stroke:#333,stroke-width:2px
style cnv_sv_annotate stroke:#333,stroke-width:2px
style msi stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
--cacheDir
- Default:
/projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--csv_input
- Default:
/<FILE_PATH>
- Comment: CSV delimited sample sheet that controls how samples are processed. The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. See note below on this page for additional information on file format.
- Default:
-
--pdx
- Default:
false
- Comment: Options: false, true. If specified, 'Xengsort' is run on reads to deconvolute human and mouse reads. Human only reads are used in analysis.
- Default:
-
--xengsort_host_fasta
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/imputed/rel_2112_v8/NOD_ShiLtJ.39.fa'
- Comment: Xengsort host fasta file. Used by Xengsort Index when
--pdx
is run, and xengsort_idx_path isnull
or false.
- Default:
-
--xengsort_idx_path
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/xengsort'
- Comment: Xengsort index for deconvolution of human and mouse reads. Used when
--pdx
is run. Ifnull
, Xengsort Index is run using ref_fa and host_fa.
- Default:
-
--xengsort_idx_name
- Default:
'hg38_GRCm39-NOD_ShiLtJ'
- Comment: Xengsort index name associated with files located in
xengsort_idx_path
or name given to outputs produced by Xengsort Index.
- Default:
-
--deduplicate_reads
- Default:
false
- Comment: Options: false, true. If specified, run bbmap clumpify on input reads. Clumpify will deduplicate reads prior to trimming. This can help with mapping and downstream steps when analyzing high coverage WGS data.
- Default:
-
--coverage_cap
- Default:
null
- Comment: If an integer value is specified, jvarkit 'Biostar154220' is used to cap coverage at the that value. See: http://lindenb.github.io/jvarkit/Biostar154220.html
- Default:
-
--primary_chrom_bed
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/Homo_sapiens_assembly38.primary_chrom.bed'
- Comment: A bed file containing the primary chromsomes with positions. Used in limiting jvarkit 'Biostar154220' to those regions with expected coverage.
- Default:
-
--split_fastq
- Default:
false
- Comment: If specified, FASTQ files will be split into chunks sized based on split_fastq_bin_size prior to mapping. This option is recommended for high coverage data.
- Default:
-
--split_fastq_bin_size
- Default:
10000000
- Comment: If split_fastq is specified, FASTQ files will splint into chunks of this size prior to mapping.
- Default:
-
--quality_phred
- Default:
15
- Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
- Default:
-
--unqualified_perc
- Default:
40
- Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
- Default:
-
--detect_adapter_for_pe
- Default:
false
- Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
- Default:
-
--ref_fa
- Default:
'/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis, points to human reference when --gen_org human. JAX users should not change this parameter.
- Default:
-
--ref_fa_indices
- Default:
'/projects/omics_share/human/GRCh38/genome/indices/gatk/bwa/Homo_sapiens_assembly38.fasta'
- Comment: Pre-compiled BWA index files. JAX users should not change this parameter.
- Default:
-
--ref_fa_dict
- Default:
'/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.dict'
- Comment: FASTA dictonary file. JAX users should not change this parameter.
- Default:
-
--combined_reference_set
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/combined_ref_set/Homo_sapiens_assembly38.fasta'
- Comment: Several tools (GRIDSS, SVABA) requires reference and bwa index files in same directory. Links used within this directory to avoid duplication of fasta and bwa indicies. See note in directory.
- Default:
-
--mismatch_penalty
- Default:
-B 8
- Comment: The BWA penalty for a mismatch.
- Default:
-
--gold_std_indels
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz’
- Comment: Used in GATK BaseRecalibrator and variant tranche recalibration derived from the GATK resource bundle. JAX users should not change this parameter.
- Default:
-
--phase1_1000G
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
- Comment: Used in GATK BaseRecalibrator derived from the GATK resource bundle. JAX users should not change this parameter.
- Default:
-
--dbSNP
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz'
- Comment: Used in variant annotation, GATK BaseRecalibrator, variant tranche recalibration, and by SVABA. JAX users should not change this parameter.
- Default:
-
--dbSNP_index
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz.tbi'
- Comment: Index associated with the dbsnp file.
- Default:
-
--chrom_contigs
- Default:
'/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.primaryChr.contig_list'
- Comment: Contig list used for scatter / gather in calling and annotation.
- Default:
-
--chrom_intervals
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/intervals/hg38_calling_intervals/'
- Comment: Chromosome intervals used for scatter gather in calling.
- Default:
-
--call_val
- Default:
50
- Comment: The minimum phred-scaled confidence threshold at which variants should be called.
- Default:
-
--ploidy_val
- Default:
'-ploidy 2'
- Comment: Sample ploidy used by Haplotypecaller in germline small variant calling.
- Default:
-
--excludeIntervalList
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/hg38_haplotypeCaller_skip.interval_list'
- Comment: Germline caller exclusion list.
- Default:
-
--hapmap
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/hapmap_3.3.hg38.vcf.gz'
- Comment: variant tranche recalibration requirement derived from the GATK resource bundle.
- Default:
-
--omni
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_omni2.5.hg38.vcf.gz'
- Comment: variant tranche recalibration requirement derived from GATK resource bundle.
- Default:
-
--pon_bed
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/WGS_1000g_GRCh38.pon.bed'
- Comment: Panel of normal samples used in in snp and indel filtering.
- Default:
-
--intervalListBed
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla.interval_list.bed'
- Comment: This file is used to extract small variants in non-exonic regions. Such calls are then attempted to be recovered via Lancet calls.
- Default:
-
--lancet_beds_directory
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/intervals/lancet_chr_beds/'
- Comment: Lancet interval bed files used in calling by that tool.
- Default:
-
--mappability_directory
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/mappability'
- Comment: Bicseq2 input requirement. Derived from the tool developer resource pack.
- Default:
-
--bicseq2_chromList
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/configs/sampleId.bicseq2.config'
- Comment: Bicseq2 config requirement. Derived from the tool developer resource pack.
- Default:
-
--bicseq2_no_scaling
- Default:
false
- Comment: false: estimate 'lamda' smoothing factor from data for CNV profile calling. true: Use standard 'lamda 4' smoothing for CNV profile calling. If BicSeq2 fails with an error, set this parameter to 'true'.
- Default:
-
--germline_filtering_vcf
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/gnomad-and-ALL_GRCh38_sites.20170504.normalized.modified.PASS.vcf.gz'
- Comment: Germline reference file used in Gridss SV call filtering. Provided by the tool developer resource pack.
- Default:
-
--gripss_pon
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/gripss_pon'
- Comment: Panel of normal files for Gripss SV call filering. Provided by the tool developer resource pack.
- Default:
-
--callRegions
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/intervals/GRCh38.callregions.bed.gz'
- Comment: Manta calling regions. Provided by the tool developer resource pack.
- Default:
-
--strelka_config
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/configs/configureStrelkaSomaticWorkflow.py.ini'
- Comment: Strelka input configuration. Provided by the tool developer resource pack.
- Default:
-
--msisensor_model
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/msisensor2/models_hg38'
- Comment: Model files for MSI calling via MSIsensor2. Provided by the tool developer resource pack.
- Default:
-
--vep_cache_directory
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/vep_data'
- Comment: VEP annotation cache. Cache provided is for Ensembl v109.
- Default:
-
--vep_fasta
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/sequence/ensembl/GRCh38.p13/Homo_sapiens.GRCh38.dna.primary_assembly.fa'
- Comment: VEP requires an ensembl based fasta. GRCh38.p13 is used for v97-v109.
- Default:
-
--cosmic_cgc
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/function/cancer_gene_census_v97.csv'
- Comment: COSMIC Cancer Gene Census annotation file. Index for file required within same location.
- Default:
-
--cosmic_cancer_resistance_muts
- Default:
'/projects/compsci/omics_share/human/GRCh38/genome/annotation/function/CosmicResistanceMutations.tsv.gz'
- Comment: COSMIC Resistance Mutations file. Index for file required within same location.
- Default:
-
--ensembl_entrez
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/GRCh39.p13_ensemblv109_entrez_id_map.csv'
- Comment: Ensembl to Entrez gene ID to HGNC symbol mapping file. used in somatic vcf finalization.
- Default:
-
--cytoband
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/GRCh38.cytoBand.UCSC.chr.sorted.txt'
- Comment: File used in bicseq2 annotations
- Default:
-
--dgv
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/DGV.GRCh38_hg38_variants_2020-02-25.bed'
- Comment: File used in bicseq2 annotations
- Default:
-
--thousandG
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/1KGP.CNV.GRCh38.canvas.merged.bed'
- Comment: File used in bicseq2 annotations
- Default:
-
--cosmicUniqueBed
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/CosmicCompleteCNA_uniqIntervals.bed'
- Comment: File used in bicseq2 annotations
- Default:
-
--cancerCensusBed
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/cancer_gene_census.GRCh38-v92.bed'
- Comment: File used in bicseq2 annotations and SV annotation.
- Default:
-
--ensemblUniqueBed
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/ensembl_genes_unique_sorted.final.v93.chr.sorted.bed'
- Comment: File used in bicseq2 annotations and SV annotation.
- Default:
-
--gap
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/GRCh38.gap.UCSC.annotated.chr.sorted.bed'
- Comment: File used in SV annotation.
- Default:
-
--dgvBedpe
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/DGV.GRCh38_hg38_variants_2020-02-25.bedpe'
- Comment: File used in SV annotation.
- Default:
-
--thousandGVcf
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/1KGP.pruned_wAFs.PASS_and_MULTIALLELIC_Mosaic.GRCh38.vcf'
- Comment: File used in SV annotation.
- Default:
-
--svPon
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/1000G-SV-PON.survivor-merged.GRCh38.filtered.bedpe'
- Comment: File used in SV annotation.
- Default:
-
--cosmicBedPe
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/annotations/cosmic-sv-GRCh38-v92.bedpe'
- Comment: File used in SV annotation.
- Default:
-
--na12878_bam
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/NA12878/NA12878_realigned_BQSR.bam'
- Comment: NA12878 BAM file. Used in un-paired sample analysis.
- Default:
-
--na12878_bai
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/NA12878/NA12878_realigned_BQSR.bai'
- Comment: NA12878 BAM index file. Used in un-paired sample analysis.
- Default:
-
--na12878_sampleName
- Default:
'ERR194147_1.fastq.gz_filtered_trimmed'
- Comment: NA12878 sample name within the NA12878 BAM file.
- Default:
-
--read_type
- Default:
PE
- Comment: Only 'PE' is accepted for this workflow.
- Default:
The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
patient
column defines how samples are paired. All combinations of normal and tumor samples that share the samepatient
ID will be paired. - The
sex
column is unused in the workflow at this time. - The
status
column defines if each sample is either'normal': 0
or'tumor': 1
. - The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths to read 1 and read 2 from an Illumina paired-end sequencing run.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
In the example case above the following output directories will be generated:
PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
Additional information on outputs is provided below.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE1_R2.fastq.gz
PATIENT_42,XX,0,NORMAL_1,L2,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE2_R2.fastq.gz
PATIENT_42,XX,0,NORMAL_1,L3,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE3_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
In the example case above the three lanes provided for the normal sample will be concatenated and the concatenated reads will be passed forward for analysis. Samples with a single lane will be passed forward for analysis. A mix of samples with multiple lanes, and single lanes can be provided.
The following output directories will be generated:
PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
Additional information on outputs is provided below.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
Note: In cases when tumor is provided without matched normal, NA12878 is used as a proxy normal sample in somatic small variant, and somatic structural variant calling. CNV calling is done with BicSeq2
on the tumor sample alone. Germline calling on NA12878 is not done. Sample contamination and concordance via Conpair
is also not done against NA12878. A mix of samples with and without pairs can also be provided.
The output directory structure of tumor only samples will be as follows:
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NA12878: Contains all TUMOR_1 by NA12878 specific files
Additional information on outputs is provided below.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_101,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_101,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
The output directory structure of samples will be as follows:
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NA12878: Contains all TUMOR_1 by NA12878 specific files
PATIENT_101--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_101--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
Additional information on outputs is provided below.
Note: In cases when tumor is provided without matched normal, NA12878 is used as a proxy normal sample in somatic small variant, and somatic structural variant calling. CNV calling is done with BicSeq2
on the tumor sample alone. Germline calling on NA12878 is not done. Sample contamination and concordance via Conpair
is also not done against NA12878. A mix of samples with and without pairs can also be provided.
The output directory structure of tumor only samples will be as follows:
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_1--NA12878: Contains all TUMOR_1 by NA12878 specific files
Additional information on outputs is provided below.
The workflow supports the mapping on one to many, many to one, and many to many normal and tumor samples.
An example one to many analysis:
patient,sex,status,sampleID,lane,fastq_1,fastq_2
PATIENT_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_2,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_42,XX,1,TUMOR_3,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
In cases of one to many, many to one, and many to many all combinations of samples will be processes against one another.
In the example case above the following output directories will be generated:
PATIENT_42--NORMAL_1: Contains all NORMAL_1 sample specific files
PATIENT_42--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_42--TUMOR_2: Contains all TUMOR_2 specific files
PATIENT_42--TUMOR_3: Contains all TUMOR_3 specific files
PATIENT_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
PATIENT_42--TUMOR_2--NORMAL_1: Contains all TUMOR_2 by NORMAL_1 specific files
PATIENT_42--TUMOR_3--NORMAL_1: Contains all TUMOR_3 by NORMAL_1 specific files
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
NOTE: All files contained in 'stats' directories are captured by MultiQC
reports.
The pipelines will output several directories relative to files that apply to individual sample or combinations of samples.
Following the example naming in the csv section above for "an example paired analysis":
Naming Convention | Description |
---|---|
*_haplotypecaller.gatk.filtered.vcf.gz |
Final filtered SNP and InDEL calls from haplotypecaller. |
*_germline_snv_indel_annotated_filtered_final.vcf |
Final filtered SNP and InDEL calls from haplotypecaller with VEP annotations. |
bam/*_realigned_BQSR.bam |
Final duplicate marked, BQSR realigned bam file used in calling. |
bam/*_realigned_BQSR.bai |
Bam index file. |
stats/*_stat |
BWA alignment metrics. |
stats/*_AlignmentMetrics.txt |
GATK Alignment metrics. |
stats/*_CollectWgsMetrics.txt |
GATK collect WGS metrics output. |
stats/*_recal_data.table |
GATK Baserecalibration table. |
stats/*_dup_metrics.txt |
Picard mark duplicates output. |
stats/*_insert_size.txt |
Estimated library insert size. |
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
NOTE: When tumor-only samples are run, there will be no <PATIENT>--NA12878
directory output. As all files associated with NA12878 specifically are not relevant.
Naming Convention | Description |
---|---|
bam/*_realigned_BQSR.bam |
Final duplicate marked, BQSR realigned bam file used in calling. |
bam/*_realigned_BQSR.bai |
Bam index file. |
stats/*_stat |
BWA alignment metrics. |
stats/*_AlignmentMetrics.txt |
GATK Alignment metrics. |
stats/*_CollectWgsMetrics.txt |
GATK collect WGS metrics output. |
stats/*_recal_data.table |
GATK Baserecalibration table. |
stats/*_dup_metrics.txt |
Picard mark duplicates output. |
stats/*_insert_size.txt |
Estimated library insert size. |
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
msi/*msisensor |
MSI Status. "The recommended msi score cutoff value is 20% (msi high: msi score >= 20%)" |
Naming Convention | Description |
---|---|
*_concordance.txt |
Sample concordance from Conpair . |
*_contamination.txt |
Sample contamination from Conpair . |
*_cnv_annotated_final.bed |
Final CNV calls restricted to high confidence and with provided with annotations. |
*_cnv_annotated_supplemental.bed |
All CNV calls with annotations. |
*_somatic_snv_indel_annotated_filtered_final.vcf |
Final filtered somatic SNVs and InDELs based on Mutect2, Strelka, and supported by Lancet. |
*_somatic_snv_indel_annotated_filtered_supplemental.vcf |
Supplementary information from filtered somatic SNVs and InDELs based on Mutect2, Strelka, and supported by Lancet. |
*_somatic_snv_indel_annotated_filtered_final.txt |
Text extraction from the VCF filtered somatic SNVs and InDELs based on Mutect2, Strelka, and supported by Lancet. |
*_somatic_snv_indel_annotated_filtered_final.maf |
Maf format converted from somatic SNVs and InDELs VCF. |
*_sv_annotated_somatic_final.bedpe |
Somatic structural variant calls with polished annotations. |
*_sv_annotated_somatic_high_confidence_final.bedpe |
Somatic structural variant calls restricted to high confidence calls with polished annotations. |
*_sv_annotated_somatic_supplemental.bedpe |
Somatic structural variant calls with all annotations. |
*_sv_annotated_somatic_high_confidence_supplemental.bedpe |
Somatic structural variant calls restricted to high confidence calls with all annotations. |
callers/*.bicseq2.png |
Bicseq2 output. |
callers/*.bicseq2.txt |
Bicseq2 raw CNV calls. |
callers/*_lancet_merged.vcf.gz |
Lancet raw SNP/InDEL calls. |
callers/*_manta_candidateSmallIndels.vcf.gz |
Manta raw small indel calls. |
callers/*_manta_candidateSV.vcf.gz |
Manta raw candidate SV calls. |
callers/*_manta_diploidSV.vcf.gz |
Manta raw diploid SV calls. |
callers/*_manta_somaticSV.vcf.gz |
Manta raw somatic SV calls, these are the calls that are merged with other SV callers. |
callers/*_mutect2_somatic.filtered.vcf.gz |
Mutect2 calls filtered by GATK 'filtermutectcalls'. |
callers/*_strelka_somatic.indels.vcf.gz |
Strelka raw InDEL calls. |
callers/*_strelka_somatic.snvs.vcf.gz |
Strelka raw snv calls. |
callers/*_.gripss.filtered.vcf.gz |
Gridss SV calls as filtered by Gripss. |
stats/*_mutect2_somatic.filtered.vcf.gz.filteringStats.tsv |
Filter Mutect2 Calls QC stats output. |
Naming Convention | Description |
---|---|
pta_report.html |
Nextflow autogenerated report. |
trace |
Nextflow autogenerated trace report for resource usage in tabular text format. |
multiqc |
MultiQC report summarizing quality metrics across samples in the analysis run. |
If the workflow is run with --keep_intermediate true
additional outputs will be saved out. This option is only recommended for debugging purposes.