-
Notifications
You must be signed in to change notification settings - Fork 10
Fingerprint Amplicon Pipeline ReadMe
NOTE: This pipeline is designed to work with the IDT xGEN sample identification amplicon panel. However, it could be used with additional IDT amplicon panels provided the files required by the trimmer Primerclip are supplied by IDT. This pipeline is adapted from the prescribed IDT best practices provided in the Primerclip application note.
• Step 1: Trim FASTQ reads
• Step 2: Map reads with BWA
• Step 3: Sort alignments and trim alignments of amplicon primers with Primerclip
• Step 4: Calculate target coverage
• Step 5: Base recalibration
• Step 6: Apply base recalibration
• Step 7: Call variants with GATK Haplotypecaller
• Step 8: Annotate with dbSNP information
• Step 9: Generate fingerprint report
• Step 10: MultiQC
flowchart TD
p0((Sample))
p1[CUTADAPT]
p2[FASTQC]
p4[BWA_MEM]
p5[SAMTOOLS_SORT_PRIMERCLIP]
p6[PRIMERCLIP]
p7[SAMTOOLS_SORT_CALLING]
o1([Genomic BAM]):::output
p8[PICARD_COLLECTTARGETPCRMETRICS]
p9[TARGET_COVERAGE_METRICS]
p10[GATK_BASERECALIBRATOR]
p11[GATK_APPLYBQSR]
p12[GATK_HAPLOTYPECALLER]
o2([Raw Variant Calls]):::output
p13[SNPSIFT_ANNOTATE]
o3([Annotated Variant Calls]):::output
p14[GENERATE_FINGERPRINT_REPORT]
o4([Fingerprint Genotype Report]):::output
o5([Off Panel Variant Calls]):::output
p15[MULTIQC]
o6([MutliQC Report]):::output
p0 --> |Raw Reads| p1
subgraph alignment [ ]
p1 --> p4
p4 --> p5
p5 --> p6
p6 --> p7
p7 --> p10
p10 --> p11
p11 --> o1
end
subgraph calling [ ]
o1 --> p12
p12 --> o2
o2 --> p13
p13 --> o3
o3 --> p14
p14 --> o4
p14 --> o5
end
subgraph qc [ ]
p1 --> p2
o1 --> p8
o1 --> p9
p1 --> p15
p2 --> p15
p6 --> p15
p8 --> p15
p9 --> p15
p10 --> p15
p15 --> o6
end
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
style alignment stroke:#333,stroke-width:2px
style calling stroke:#333,stroke-width:2px
style qc stroke:#333,stroke-width:2px
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
--cacheDir
- Default:
/projects/omics_share/meta/containers
- Comment: This is directory that contains cached Singularity containers. JAX users should not change this parameter.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--extension
- Default:
.fastq.gz
- Comment: The expected extension for the input read files.
- Default:
-
--pattern
- Default:
"*_R{1,2}*"
- Comment: The expected R1 / R2 matching pattern. The default value will match reads with names like this
READ_NAME_R1_MoreText.fastq.gz
orREAD_NAME_R1.fastq.gz
- Default:
-
--read_type
- Default:
PE
- Comment: Options:
PE
andSE
. Default:PE
. Type of reads: paired end (PE) or single end (SE).
- Default:
-
--concat_lanes
- Default:
false
- Comment: Options:
false
andtrue
. Default:false
. If this boolean is specified, FASTQ files will be concatenated by sample. Used in cases where samples are divided across individual sequencing lanes.
- Default:
-
--csv_input
- Default:
null
- Comment: Provide a CSV manifest file with the header: "sampleID,lane,fastq_1,fastq_2". See below for an example file. Fastq_2 is optional and used only in PE data. Fastq files can either be absolute paths to local files, or URLs to remote files. If remote URLs are provided, *
--download_data
can be specified.
- Default:
-
--download_data
- Default:
null
- Comment: Requires *
--csv_input
. When specified, read data in the CSV manifest will be downloaded from provided URLs with Aria2.
- Default:
-
--gen_org
- Default:
human
- Comment: Options:
human
.
- Default:
-
--multiqc_config
- Default:
/<PATH>
- Comment: The path to amplicon.yaml. The configuration file used while running MultiQC
- Default:
-
--cutadaptMinLength
- Default:
20
- Comment: The minimum length to discard processed reads.
- Default:
-
--cutadaptQualCutoff
- Default:
20
- Comment: The quality cutoff used to trim low-quality ends from reads.
- Default:
-
--cutadaptAdapterR1
- Default:
'AGATCGGAAGAG'
- Comment: TruSeq Illumina adapter trimmer sequence for cutadapt. Change to sequence required by sample libraries.
- Default:
-
--cutadaptAdapterR2
- Default:
'AGATCGGAAGAG'
- Comment: TruSeq Illumina adapter trimmer sequence. Change to sequence required by sample libraries.
- Default:
-
--ref_fa
- Default:
'/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis.
- Default:
-
--ref_fa_indices
- Default:
'/projects/omics_share/human/GRCh38/genome/indices/gatk/bwa/Homo_sapiens_assembly38.fasta'
- Comment: Pre-compiled BWA index files.
- Default:
-
--mismatch_penalty
- Default:
-B 8
- Comment: The BWA penalty for a mismatch.
- Default:
-
--masterfile
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_masterfile.txt'
- Comment: Primerclip master file for amplicon primer trimming. This file is specific to each IDT xGen amplicon panel, and should be changed if * mples are not derived from xGen sample ID.
- Default:
-
--amplicon_primer_intervals
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_primers.interval_list'
- Comment: GATK interval file with primer positions for the specific amplicon panel for calculation of coverage metrics. This file is specific to each IDT xGen amplicon panel, and should be changed if samples are not derived from xGen sample ID. File can be generated with: (Picard BedToIntervalList)[https://gatk.broadinstitute.org/hc/en-us/articles/13832706340763-BedToIntervalList-Picard]
- Default:
-
--amplicon_target_intervals
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_merged_targets.interval_list'
- Comment: GATK interval file with target positions for the specific amplicon panel for calculation of coverage metrics. This file is specific to each IDT xGen amplicon panel, and should be changed if samples are not derived from xGen sample ID. File can be generated with: (Picard BedToIntervalList)[https://gatk.broadinstitute.org/hc/en-us/articles/13832706340763-BedToIntervalList-Picard]
- Default:
-
--amplicon_rsid_targets
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_merged_targets.txt'
- Comment: Amplicon SNP target file containing rsID and gene information. Used in generation of the final fingerprint report file.
- Default:
-
--gold_std_indels
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz’
- Comment: Used in GATK BaseRecalibrator.
- Default:
-
--phase1_1000G
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
- Comment: Used in GATK BaseRecalibrator.
- Default:
-
--dbSNP
- Default:
'/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz'
- Comment: The dbSNP database contains known single nucleotide polymorphisms, and is used in the annotation of known variants.
- Default:
-
--dbSNP_index
- Default: '/projects/omics_share/human/GRCh38/genome/annotation/snps_indels/dbsnp_151.vcf.gz.tbi'
- Comment: The dbSNP index file associated with the dbSNP VCF file.
-
--call_val
- Default:
50
- Default: The minimum phred-scaled confidence threshold at which variants should be called.
- Default:
-
--ploidy_val
- Default:
'-ploidy 2'
- Comment: variable in haplotypecaller. Not required for amplicon, but present in module.
- Default:
-
--target_gatk
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/capture_kit_files/IDT/xGen_sampleID_amplicon/hg38Lifted_xGen_SampleID_merged_targets.bed'
- Comment: A bed file with amplicon target intervals as defined in the amplicon array used in the data. NOTE: This file MUST reflect the amplicon array used to generate your data.
-
--call_val
- Default:
50
- Comment: The minimum phred-scaled confidence threshold at which variants should be called.
- Default:
-
--tmpdir
- Default:
'/fastscratch/${USER}'
- Comment: Temporary directory to store intermediate files generated outside of the standard Nextflow cache location.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
atac_report.html |
Nextflow autogenerated report. |
trace.txt |
Nextflow trace of processes. |
multiqc |
MultiQC report summarizing quality metrics across samples in the analysis run. |
*variants_raw.vcf |
Raw VCF calls from GATK Haplotypecaller |
*_variants_raw_dbsnpID.vcf |
dbSNP annotated variant calls. |
*_fingerprint.on_target_SNPs.tsv |
Report containing a complete list of all amplicon targets and calls |
*_fingerprint.off_target_SNPs.tsv |
Report containing all off target variant calls |
bam |
Directory containing alignments post base realignment (i.e., post apply BQSR). |
stats |
Directory containing all individual stats files output by the pipeline. |
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz,/path/to/sample_42_001_R2.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz,/path/to/sample_42_002_R2.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz,/path/to/sample_101_001_R2.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz,/path/to/sample_10191_001_R2.fastq.gz
sampleID,lane,fastq_1,fastq_2
Sample_42,Lane_1,/path/to/sample_42_001_R1.fastq.gz
Sample_42,Lane_2,/path/to/sample_42_002_R1.fastq.gz
Sample_101,Lane_1,/path/to/sample_101_001_R1.fastq.gz
Sample_10191,Lane_1,/path/to/sample_10191_001_R1.fastq.gz