-
Notifications
You must be signed in to change notification settings - Fork 10
Ancestry Pipeline ReadMe
MikeWLloyd edited this page Jun 5, 2024
·
3 revisions
• Step 1: BAM Index
• Step 2: SNP Region Pileup
• Step 3: SNP Calling
• Step 4: SNP Filtering
• Step 5: SNP Annotation
• Step 6: VCF to Eigenstrat format
• Step 7: SNPweights Infer Ancestry
flowchart TD
p1((Sample\nAlignment File))
p2[SAMTOOLS_INDEX]
p3[BCFTOOLS_MPILEUP]
p4[BCFTOOLS_CALL]
p5[BCFTOOLS_FILTER]
p6[BCFTOOLS_ANNOTATE]
p7[VCF2EIGENSTRAT]
p8[SNPWEIGHTS_INFERANC]
o1([Genetic Ancestry Estimation]):::output
p1 --> p2
p2 --> p3
p3 --> p4
p4 --> p5
p5 --> p6
p6 --> p7
p7 --> p8
p8 --> o1
classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000
-
--pubdir
- Default:
/<PATH>
- Comment: The directory that the saved outputs will be stored.
- Default:
-
--organize_by
- Default:
sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
- Default:
-
-w
- Default:
/<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
- Default:
-
--sample_folder
- Default:
/<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
- Default:
-
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,bam". See below for an example file.
-
--download_data
can be specified.
-
--gen_org
- Default:
human
- Comment: Options:
human
.
- Default:
-
--ref_fa
- Default:
'/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis. JAX users should not change this parameter.
- Default:
-
--genotype_targets
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file for the ancestry panel. Can contain annotation information.
- Default:
-
--snpID_list
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2.list'
- Comment: Target SNPs in list used in BCFtools filtering step.
- Default:
-
--snp_annotations
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file with annotations for the ancestry panel.
- Default:
-
--snpweights_panel
- Default:
'/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/ancestry_panel_v2.snpwt'
- Comment: SNP weights panel in the appropriate format.
- Default:
NOTE: *
Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.
Naming Convention | Description |
---|---|
ancestry_report.html |
Nextflow autogenerated report |
trace.txt |
Nextflow trace of processes |
*.ancestry.tsv |
Genetic ancestry report. See https://www.biorxiv.org/content/10.1101/2022.10.24.513591v1 for details on report and methods |
The required input header is: sampleID,lane,fastq_1,fastq_2
. Samples can be provided either paired or un-paired.
- The
sampleID
column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID. - The
lane
column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis. - The
fastq_1
andfastq_2
columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.
sampleID,bam
Sample_42,/path/to/sample_42_dedup_realigned.bam
Sample_101,Lane_1,/path/to/sample_101_dedup_realigned.bam
Sample_10191,Lane_1,/path/to/sample_10191_dedup_realigned.bam