-
Notifications
You must be signed in to change notification settings - Fork 10
Mouse PTA ReadMe
For all input samples:
• Fastp read quality and adapter trimming
• Get Read Group Information
• BWA-MEM Alignment
• Picard SortSam and Mark Duplicates
• Collect Alignment Summary Metrics
If paired sample:
• Germline variant calling
• Germline variant filtering
• Germline variant annotation
For all samples (C57L_J used if unpaired):
• GATK Mutect2 variant calling
• Filter Mutect2 calls
• Lancet variant calling
• Svaba SV calling
• Manta SV calling
• Strelka2 SNV and SV calling
• Lumpy SV calling
• Delly SV calling
• Delly CNV calling
• VCF Merge prep steps. A number of scripts are applied to ensure samples within VCF files are properly oriented prior to merging, and calls are properly formatted
• Intergenic variant rescue and confirmation via Lancet. A number of scripts are applied to determine if intergenic variants are supported by the caller Lancet
• SNV VCF files are merged across all SNV callers
• VEP annotation of merged SNVs
• Bicseq2 CNV calls are annotated
• SV calls are merged across callers
• SV calls are annotated with known insertion, deletion, transversions and exclusion regions. Annotation is done at 80% overlap between called SV event and known event size.
• SV calls are annotated with CNV regions
• A final filter is applied to CNV annotated SV calls
• MultiQC report generation
flowchart TD
p00((CSV Sample Sheet))
p01[PARSE_SAMPLE_SHEET:CONCATENATE_READS]
p00 --> p01
p01 --> |Tumor Sample| t02
p01 --> |"`Normal Sample:
**If Normal Sample
Provided**`"| n02
subgraph tumor [ ]
t02[FASTP]
t03[BWA_MEM]
t04[PICARD_SORTSAM]
t05[PICARD_MARKDUPLICATES]
%% NOTE: BaseRecalibrator and BQSR requires known sites to recal around.
to1([Tumor Genomic Bam]):::output
t02 --> t03
t03 --> t04
t04 --> t05
t05 --> to1
end
subgraph normal [ ]
n02[FASTP]
n03[BWA_MEM]
n04[PICARD_SORTSAM]
n05[PICARD_MARKDUPLICATES]
%% REMOVE DUPE
%% Indel realigner from GATK
%% NOTE: BaseRecalibrator and BQSR requires known sites to recal around.
no1([Normal Genomic Bam]):::output
n02 --> n03
n03 --> n04
n04 --> n05
n05 --> no1
end
no1 -..-> |If Normal\nSample Provided| m1
altBAM -..-> |If No Normal\nSample Provided| m1
to1 --> m1
altBAM[(ALT\nBAM)]
m1 --> p35
m1 --> p39
m1 --> p41
m1 --> p42
m1 --> p43
m1 --> p44
m1 --> p54
m1 --> p57.1
m1((Join:\nTumor & Normal))
subgraph germline [ ]
no1 -..-> |If Normal\nSample Provided|p20
p20[GATK_HAPLOTYPECALLER_SV_GERMLINE]
p21[GATK_SORTVCF_GERMLINE]
p27[BCFTOOLS_GERMLINE_FILTER]
%% p28[BCFTOOLS_SPLITMULTIALLELIC_REGIONS]
%% p29[VEP_GERMLINE]
%% p30[BCFTOOLS_REMOVESPANNING]
%% p33[SNPSIFT_ANNOTATE_DBSNP_GERMLINE]
%% p34[GERMLINE_VCF_FINALIZATION]
annot_summary1{{Germline Annotation via VEP\nSteps truncated for figure clarity}}
o3([Germline Variants]):::output
o4([Annotated Germline Variants]):::output
p20 --> p21
p21 --> p27
p27 --> o3
o3 --> annot_summary1
subgraph germline_annot [ ]
%% p27 --> p28
%% p28 --> p29
%% p29 --> p30
%% p30 --> p31
%% p31 --> p32
%% p32 --> p33
%% p33 --> p34
%% p34 --> o4
annot_summary1 --> o4
end
end
subgraph somatic_variant [ ]
p35[GATK_MUTECT2]
p36[GATK_SORTVCF_MUTECT]
p37[GATK_MERGEMUTECTSTATS]
p38[GATK_FILTERMUECTCALLS]
o5([Mutect2 SNV Calls]):::output
p44[LANCET]
p45[GATK_SORTVCF_LANCET]
oo6([Lancet SNV Calls]):::output
p39[DELLY_SOMATIC]
p40[DELLY_FILTER_SOMATIC]
o6([DELLY SV Calls]):::output
p41[MANTA]
o7([Manta SV and SNV Calls]):::output
p42[STRELKA2]
o8([Strelka2 SNV Calls]):::output
p43[SMOOVE]
o9([SMOOVE_Lumpy SV Calls]):::output
p54[DELLY_CNV_SOMATIC]
p55[BCFTOOLS_MERGE_DELLY_CNV]
p56[DELLY_CLASSIFY]
p57[BCFTOOLS_QUERY_DELLY_CNV]
o10([DELLY CNV Calls]):::output
p57.1[SVABA]
o10.1([SVABA SV and SNV Calls]):::output
p35 --> p36
p36 --> p37
p37 --> p38
p38 --> o5
p39 --> p40
p40 --> o6
p41 --> o7
o7 --> p42
p42 --> o8
p43 --> o9
p44 --> p45
p45 --> oo6
p54 --> p55
p55 --> p56
p56 --> p57
p57 --> o10
p57.1 --> o10.1
note1{{Merge prep.\nSteps truncated}}
note2{{Merge prep.\nSteps truncated}}
note3{{Merge prep.\nSteps truncated}}
note3.5{{Merge prep.\nSteps truncated}}
note3.6{{Merge prep.\nSteps truncated}}
%% p57[RENAME_METADATA]
%% p58[MERGE_PREP]
%% p59[RENAME_VCF]
%% p60[COMPRESS_INDEX_VCF]
%% p61[BCFTOOLS_SPLITMULTIALLELIC]
%% p62[SPLIT_MNV]
%% p63[GATK_SORTVCF_TOOLS]
o5 --> note1
oo6 --> note2
o8 --> note3
o7 --> note3.5
o10.1 --> note3.6
end
note1 --> p64
note2 --> p64
note3 --> p64
note3.5 --> p64
note3.6 --> p64
p64[BCFTOOLS_MERGECALLERS]
%% p65[COMPRESS_INDEX_VCF_ALL_CALLERS]
p64 --> note4
subgraph lancet_confirm [ ]
%% NOTE: There are many lancet confirm steps. First prep, then confirm, then re-merge.
%% p66[BEDTOOLS_STARTCANDIDATES]
%% p67[GET_CANDIDATES]
%% p68[COMPRESS_INDEX_VCF_REGION]
%% p69[VCF_TO_BED]
%% p70[LANCET_CONFIRM]
%% p71[COMPRESS_INDEX_VCF_REGION_LANCET]
%% p72[BCFTOOLS_INTERSECTVCFS]
%% p73[RENAME_METADATA_LANCET]
%% p74[MERGE_PREP_LANCET]
%% p75[RENAME_VCF_LANCET]
%% p76[COMPRESS_INDEX_VCF_LANCET]
%% p77[BCFTOOLS_SPLITMULTIALLELIC_LANCET]
%% p78[SPLIT_MNV_LANCET]
%% p79[REMOVE_CONTIG]
%% p80[GATK_SORTVCF_TOOLS_LANCET]
%% p81[BCFTOOLS_MERGECALLERS_FINAL]
%% p82[COMPRESS_INDEX_VCF_MERGED]
%% p83[MERGE_COLUMNS]
%% p84[ADD_NYGC_ALLELE_COUNTS]
%% p85[ADD_FINAL_ALLELE_COUNTS]
%% p86[FILTER_PON]
%% p87[FILTER_VCF]
%% p88[SNV_TO_MNV_FINAL_FILTER]
%% p89[GATK_SORTVCF_SOMATIC]
%% p90[REORDER_VCF_COLUMNS]
%% p91[COMPRESS_INDEX_MERGED_VCF]
note0[[The steps in this subgraph are truncated for clarity\nManta used as 'support' calls]]
note4{{Extract non exonic variant calls}}
note5{{Confirm non exonic variant calls with Lancet}}
note6{{Merge Lancet confirmed to exonic calls}}
note4 --> note5
note5 --> note6
%% note6 --> note7
end
subgraph snv_annotate [ ]
p92[VEP_SOMATIC]
p95[SNPSIFT_ANNOTATE_DBSNP_SOMATIC]
p96[SOMATIC_VCF_FINALIZATION]
o11([Annotated filtered\nSomatic SNV and InDELs Calls]):::output
%% p105[:FILTER_BEDPE]
%% p106[:FILTER_BEDPE_SUPPLEMENTAL]
note6 --> p92
p92 --> p95
p95 --> p96
p96 --> o11
%% note2{{}}
end
subgraph cnv_sv_annotate [ ]
o10 --> p97
p97[ANNOTATE_DELLY_CNV]
o12([Annotated CNV Regions]):::output
o7 --> p98
o9 --> p98
o6 --> p98
o10.1 --> p98
p98[MERGE_SV]
p99[ANNOTATE_SV]
p100[ANNOTATE_SV_SUPPLEMENTAL]
p101[ANNOTATE_GENES_SV]
p102[ANNOTATE_GENES_SV_SUPPLEMENTAL]
p103[ANNOTATE_SV_WITH_CNV]
p104[ANNOTATE_SV_WITH_CNV_SUPPLEMENTAL]
o13([Annotated SV Calls]):::output
o14([Annotated SV]):::output
p97 --> o12
p98 --> p99
p99 --> p100
p100 --> p101
p101 --> p102
p102 --> o13
p102 --> p103
p97 --> p103
p103 --> p104
p104 --> o14
end
o11 ~~~ note42
subgraph qc [ ]
temp0((Fastq Files\nFrom Above))
temp1((Genomic BAMs\nFrom Above))
temp2((Logs from:\nBWA\nTrimming\nMark Duplicates))
p03[FASTQC]
p13[PICARD_COLLECTALIGNMENTSUMMARYMETRICS]
p14[PICARD_COLLECTWGSMETRICS]
p142[MULTIQC]
o15([MultQC Report]):::output
note42[[For clarity\nQC steps not connected to main graph]]
temp0 --> p03
temp1 --> p13
temp1 --> p14
temp2 --> p142
p03 --> p142
p13 --> p142
p14 --> p142
p142 --> o15
%% n02 --> p03
%% t02 --> p03
%% to1 --> p13
%% to1 --> p14
%% no1 --> p13
%% no1 --> p14
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style tumor stroke:#333,stroke-width:2px
style normal stroke:#333,stroke-width:2px
style germline stroke:#333,stroke-width:2px
style somatic_variant stroke:#333,stroke-width:2px
style lancet_confirm stroke:#333,stroke-width:2px
style snv_annotate stroke:#333,stroke-width:2px
style cnv_sv_annotate stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
--cacheDir
- Default:
/projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--csv_input
- Default:
/<FILE_PATH>
- Comment: CSV delimited sample sheet that controls how samples are processed. The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2. See note below on this page for additional information on file format.
- Default:
-
--deduplicate_reads
- Default:
false
- Comment: Options: false, true. If specified, run bbmap clumpify on input reads. Clumpify will deduplicate reads prior to trimming. This can help with mapping and downstream steps when analyzing high coverage WGS data.
- Default:
-
--coverage_cap
- Default:
null
- Comment: If an integer value is specified, jvarkit 'Biostar154220' is used to cap coverage at the that value. See: http://lindenb.github.io/jvarkit/Biostar154220.html
- Default:
-
--primary_chrom_bed
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/Mus_musculus.GRCm39.dna.primary_assembly.bed'
- Comment: A bed file containing the primary chromsomes with positions. Used in limiting jvarkit 'Biostar154220' to those regions with expected coverage.
- Default:
-
--split_fastq
- Default:
false
- Comment: If specified, FASTQ files will be split into chunks sized based on split_fastq_bin_size prior to mapping. This option is recommended for high coverage data.
- Default:
-
--split_fastq_bin_size
- Default:
10000000
- Comment: If split_fastq is specified, FASTQ files will splint into chunks of this size prior to mapping.
- Default:
-
--quality_phred
- Default:
15
- Comment: The quality value that is required for a base to pass. Default: 15 which is a phred quality score of >=Q15.
- Default:
-
--unqualified_perc
- Default:
40
- Comment: Percent of bases that are allowed to be unqualified (0~100). Default: 40 which is 40%.
- Default:
-
--detect_adapter_for_pe
- Default:
false
- Comment: If true, adapter auto-detection is used for paired end data. By default, paired-end data adapter sequence auto-detection is disabled as the adapters can be trimmed by overlap analysis. However, --detect_adapter_for_pe will enable it. Fastp will run a little slower if you specify the sequence adapters or enable adapter auto-detection, but usually result in a slightly cleaner output, since the overlap analysis may fail due to sequencing errors or adapter dimers.
- Default:
-
--ref_fa
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis, points to human reference when --gen_org human. JAX users should not change this parameter.
- Default:
-
--ref_fa_indices
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/indices/ensembl/v105/bwa/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: Pre-compiled BWA index files. JAX users should not change this parameter.
- Default:
-
--ref_fa_dict
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.dict'
- Comment: FASTA dictonary file. JAX users should not change this parameter.
- Default:
-
--combined_reference_set
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/combined_ref_set/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: Several tools (GRIDSS, SVABA) requires reference and bwa index files in same directory. Links used within this directory to avoid duplication of fasta and bwa indicies. See note in directory.
- Default:
-
--mismatch_penalty
- Default:
-B 8
- Comment: The BWA penalty for a mismatch.
- Default:
-
--dbSNP
- Default:
'/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/GCA_000001635.9_current_ids.vcf.g'
- Comment: Used in variant annotation and by SVABA. JAX users should not change this parameter.
- Default:
-
--dbSNP_index
- Default:
'/projects/omics_share/mouse/GRCm39/genome/annotation/snps_indels/GCA_000001635.9_current_ids.vcf.gz.tbi'
- Comment: Index associated with the dbsnp file.
- Default:
-
--chrom_contigs
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.primaryChr.contig_list'
- Comment: Contig list used for scatter / gather in calling and annotation.
- Default:
-
--chrom_intervals
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_calling_intervals/'
- Comment: Chromosome intervals used for scatter gather in calling.
- Default:
-
--call_val
- Default:
50
- Comment: The minimum phred-scaled confidence threshold at which variants should be called.
- Default:
-
--ploidy_val
- Default:
'-ploidy 2'
- Comment: Sample ploidy used by Haplotypecaller in germline small variant calling.
- Default:
-
--excludeIntervalList
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mm39.excluderanges.interval_list'
- Comment: Germline caller exclusion list.
- Default:
-
--intervalListBed
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/PTA_inputs/filtering/SureSelect_V6plusCOSMIC.target.GRCh38_full_analysis_set_plus_decoy_hla.interval_list.bed'
- Comment: This file is used to extract small variants in non-exonic regions. Such calls are then attempted to be recovered via Lancet calls.
- Default:
-
--lancet_beds_directory
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/lancet_chr_beds/'
- Comment: Lancet interval bed files used in calling by that tool.
- Default:
-
--delly_exclusion
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_gap_delly_exclusion.txt'
- Comment: Delly CNV calling exclusion list.
- Default:
-
--delly_mappability
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mappability/GRCm39.p0.map.gz'
- Comment: Delly CNV calling mappability file.
- Default:
-
--cnv_window
- Default:
10000
- Comment: Delly CNV calling read depth window size. Default value is tool default. This parameter is included for testing purposes only.
- Default:
-
--cnv_min_size
- Default:
10000
- Comment: Delly CNV classification minimum CNV size. Default value is tool default. This parameter is included for testing purposes only.
- Default:
-
--cnv_germline_prob
- Default:
0.00100000005
- Comment: Delly CNV classification germline probability. Default value is tool default. This parameter is included for testing purposes only.
- Default:
-
--callRegions
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39.callregions.bed.gz'
- Comment: Manta calling regions. Provided by the tool developer resource pack.
- Default:
-
--strelka_config
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/configs/configureStrelkaSomaticWorkflow.py.ini'
- Comment: Strelka input configuration. Provided by the tool developer resource pack.
- Default:
-
--vep_cache_directory
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/vep_data'
- Comment: VEP annotation cache. Cache provided is for Ensembl v109.
- Default:
-
--vep_fasta
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/sequence/ensembl/GRCm39.p0/Mus_musculus.GRCm39.dna.primary_assembly.fa'
- Comment: VEP requires an ensembl based fasta. GRCh38.p13 is used for v97-v109.
- Default:
-
--cytoband
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/annotations/GRCm38.liftedTo.GRCm39.cytoBand.UCSC.chr.sorted.bed'
- Comment: Cytoband file used in CNV annotations. Derived from UCSC table, lifted from GRCm38 to GRCm39.
- Default:
-
--known_del
- Default:
'/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_DEL_mm39_sorted.bed'
- Comment: Used in SV annotation, and filtering. Deletion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
- Default:
-
--known_ins
- Default:
'/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_INS_mm39_sorted.bed'
- Comment: Used in SV annotation, and filtering. Insertion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
- Default:
-
--known_inv
- Default:
'/projects/omics_share/mouse/GRCm39/genome/annotation/struct_vars/ferraj_2023_inv_ins_del/variants_freeze5_sv_sym_INV_mm39_sorted.bed'
- Comment: Used in SV annotation, and filtering. Inversion calls from: https://pubmed.ncbi.nlm.nih.gov/37228752/
- Default:
-
--ensemblUniqueBed
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/annotations/ensembl_genes_unique_sorted.final.v110.chr.sorted.bed'
- Comment: File used in CNV and SV annotation.
- Default:
-
--gap
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/GRCm39_gap.bed'
- Comment: File used in SV annotation. From UCSC table browser.
- Default:
-
--exclude_list
- Default:
'/projects/compsci/omics_share/mouse/GRCm39/genome/annotation/intervals/mm39.excluderanges_cleaned.bed'
- Comment: File used in SV annotation. From: https://dozmorovlab.github.io/excluderanges/.
- Default:
-
--proxy_normal_bam
- Default:
'/projects/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/C57L_J/C57L_J_dedup.bam'
- Comment: Proxy BAM file. Used in un-paired sample analysis. C57L_J at 30x is used by default.
- Default:
-
--proxy_normal_bai
- Default:
'/projects/omics_share/mouse/GRCm39/supporting_files/PTA_inputs/C57L_J/C57L_J_dedup.bam.bai'
- Comment: Proxy BAM index file. Used in un-paired sample analysis. C57L_J at 30x is used by default.
- Default:
-
--proxy_normal_sampleName
- Default:
'C57L_J'
- Comment: Proxy sample name within the proxy BAM file. C57L_J used by default.
- Default:
-
--read_type
- Default:
PE
- Comment: Only 'PE' is accepted for this workflow.
- Default:
The required input header is: patient,sex,status,sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
patient
column defines how samples are paired. All combinations of normal and tumor samples that share the samepatient
ID will be paired. - The
sex
column is unused in the workflow at this time. - The
status
column defines if each sample is either'normal': 0
or'tumor': 1
. - The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths to read 1 and read 2 from an Illumina paired-end sequencing run.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
In the example case above the following output directories will be generated:
SAMPLE_42--NORMAL_1: Contains all NORMAL_1 sample specific files
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
Additional information on outputs is provided below.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE1_R2.fastq.gz
SAMPLE_42,XX,0,NORMAL_1,L2,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE2_R2.fastq.gz
SAMPLE_42,XX,0,NORMAL_1,L3,/path/to/normal1_R1.fastq.gz,/path/to/normal1_LANE3_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
In the example case above the three lanes provided for the normal sample will be concatenated and the concatenated reads will be passed forward for analysis. Samples with a single lane will be passed forward for analysis. A mix of samples with multiple lanes, and single lanes can be provided.
The following output directories will be generated:
SAMPLE_42--NORMAL_1: Contains all NORMAL_1 sample specific files
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
Additional information on outputs is provided below.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
Note: In cases when tumor is provided without matched normal, a proxy normal sample is used in somatic small variant, somatic structural variant calling, and CNV calling. Germline calling on the proxy normal sample is not done. A mix of samples with and without pairs can also be provided. By default a C57L_J sample at 30x coverage is used. It is assumed in the following document that C57L_J was used.
The output directory structure of tumor only samples will be as follows:
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--C57L_J: Contains all TUMOR_1 by proxy C57L_J specific files
Additional information on outputs is provided below.
patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
PATIENT_101,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
PATIENT_101,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
The output directory structure of samples will be as follows:
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--C57L_J: Contains all TUMOR_1 by C57L_J specific files
PATIENT_101--TUMOR_1: Contains all TUMOR_1 specific files
PATIENT_101--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
Additional information on outputs is provided below.
Note: In cases when tumor is provided without matched normal, a proxy normal sample is used in somatic small variant, somatic structural variant calling, and CNV calling. Germline calling on the proxy normal sample is not done. A mix of samples with and without pairs can also be provided. By default a C57L_J sample at 30x coverage is used. It is assumed in the following document that C57L_J was used.
The output directory structure of tumor only samples will be as follows:
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_1--C57L_J: Contains all TUMOR_1 by C57L_J specific files
Additional information on outputs is provided below.
The workflow supports the mapping on one to many, many to one, and many to many normal and tumor samples.
An example one to many analysis:
patient,sex,status,sampleID,lane,fastq_1,fastq_2
SAMPLE_42,XX,0,NORMAL_1,L1,/path/to/normal1_R1.fastq.gz,/path/to/normal1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_1,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_2,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
SAMPLE_42,XX,1,TUMOR_3,L1,/path/to/tumor1_R1.fastq.gz,/path/to/tumor1_R2.fastq.gz
In cases of one to many, many to one, and many to many all combinations of samples will be processes against one another.
In the example case above the following output directories will be generated:
SAMPLE_42--NORMAL_1: Contains all NORMAL_1 sample specific files
SAMPLE_42--TUMOR_1: Contains all TUMOR_1 specific files
SAMPLE_42--TUMOR_2: Contains all TUMOR_2 specific files
SAMPLE_42--TUMOR_3: Contains all TUMOR_3 specific files
SAMPLE_42--TUMOR_1--NORMAL_1: Contains all TUMOR_1 by NORMAL_1 specific files
SAMPLE_42--TUMOR_2--NORMAL_1: Contains all TUMOR_2 by NORMAL_1 specific files
SAMPLE_42--TUMOR_3--NORMAL_1: Contains all TUMOR_3 by NORMAL_1 specific files
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
NOTE: All files contained in 'stats' directories are captured by MultiQC
reports.
The pipelines will output several directories relative to files that apply to individual sample or combinations of samples.
Following the example naming in the csv section above for "an example paired analysis":
Naming Convention | Description |
---|---|
*_haplotypecaller.gatk.filtered.vcf.gz |
Final filtered SNP and InDEL calls from haplotypecaller. |
*_germline_snv_indel_annotated_filtered_final.vcf |
Final filtered SNP and InDEL calls from haplotypecaller with VEP annotations. |
bam/*_dedup.bam |
Final duplicate marked bam file used in calling. |
bam/*_dedup.bai |
Bam index file. |
stats/*_stat |
BWA alignment metrics. |
stats/*_AlignmentMetrics.txt |
GATK Alignment metrics. |
stats/*_CollectWgsMetrics.txt |
GATK collect WGS metrics output. |
stats/*_dup_metrics.txt |
Picard mark duplicates output. |
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
NOTE: When tumor-only samples are run, there will be no <PATIENT>--C57L_J
directory output. As all files associated with C57L_J specifically are not relevant.
Naming Convention | Description |
---|---|
bam/*_dedup.bam |
Final duplicate marked, BQSR realigned bam file used in calling. |
bam/*_dedup.bai |
Bam index file. |
stats/*_stat |
BWA alignment metrics. |
stats/*_AlignmentMetrics.txt |
GATK Alignment metrics. |
stats/*_CollectWgsMetrics.txt |
GATK collect WGS metrics output. |
stats/*_dup_metrics.txt |
Picard mark duplicates output. |
stats/*_R1.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*_R2.fastq.gz_filtered_trimmed_fastqc.html |
FastQC report. |
stats/*R1.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
stats/*R2.fastq.gz_filtered_trimmed_fastqc.zip |
FastqQC report. |
Naming Convention | Description |
---|---|
*_cnv_annotated_final.bed |
Final CNV calls restricted to high confidence and with provided with annotations. |
*_cnv_annotated_supplemental.bed |
All CNV calls with annotations. |
*_somatic_snv_indel_annotated_filtered_final.vcf |
Final filtered somatic SNVs and InDELs based on Mutect2, Strelka, Svaba, and supported by Lancet. |
*_somatic_snv_indel_annotated_filtered_supplemental.vcf |
Supplementary information from filtered somatic SNVs and InDELs based on Mutect2, Strelka, Svaba, and supported by Lancet. |
*_somatic_snv_indel_annotated_filtered_final.txt |
Text extraction from the VCF filtered somatic SNVs and InDELs based on Mutect2, Strelka, Svaba, and supported by Lancet. |
*_manta_lumpy_delly_svaba_sv_annotated_genes_cnv.bedpe |
Somatic structural variant calls pre-filtering. |
*_manta_lumpy_delly_svaba_sv_annotated_genes_cnv_supplemental.bedpe |
Supplementary somatic structural variant information for pre-filtered calls. |
*_sv_annotated_somatic_final.bedpe |
Somatic structural variant calls with polished annotations. |
*_sv_annotated_somatic_high_confidence_final.bedpe |
Somatic structural variant calls restricted to high confidence calls with polished annotations. |
*_sv_annotated_somatic_supplemental.bedpe |
Somatic structural variant calls with all annotations. |
*_sv_annotated_somatic_high_confidence_supplemental.bedpe |
Somatic structural variant calls restricted to high confidence calls with all annotations. |
callers/*_delly_somatic_cnv_classified.bcf |
Raw Delly CNV classification in BCF format. |
callers/*_delly_somatic_cnv_segmentation.bed |
Delly CNV segmentation regions in BED format, converted from BCF. |
cnv_plots/*.png |
Delly CNV plots by chromosome and genome wide. |
callers/*_delly_filtered_somaticSV.vcf.gz |
|
callers/*_lancet_merged.vcf.gz |
Lancet raw SNP/InDEL calls. |
callers/*_manta_candidateSmallIndels.vcf.gz |
Manta raw small indel calls. |
callers/*_manta_candidateSV.vcf.gz |
Manta raw candidate SV calls. |
callers/*_manta_diploidSV.vcf.gz |
Manta raw diploid SV calls. |
callers/*_manta_somaticSV.vcf.gz |
Manta raw somatic SV calls, these are the calls that are merged with other SV callers. |
callers/*_mutect2_somatic.filtered.vcf.gz |
Mutect2 calls filtered by GATK 'filtermutectcalls'. |
callers/*-smoove.genotyped.vcf.gz |
Smoove (Lumpy) raw SV calls. |
callers/*_strelka_somatic.indels.vcf.gz |
Strelka raw InDEL calls. |
callers/*_strelka_somatic.snvs.vcf.gz |
Strelka raw snv calls. |
Naming Convention | Description |
---|---|
pta_report.html |
Nextflow autogenerated report. |
trace |
Nextflow autogenerated trace report for resource usage in tabular text format. |
multiqc |
MultiQC report summarizing quality metrics across samples in the analysis run. |
If the workflow is run with --keep_intermediate true
additional outputs will be saved out. This option is only recommended for debugging purposes.