Ancestry Pipeline ReadMe

Genetic Ancestry Estimation Documentation

Ancestry Pipeline (--workflow ancestry)

•	Step 1: BAM Index  
•	Step 2: SNP Region Pileup  
•	Step 3: SNP Calling  
•	Step 4: SNP Filtering  
•	Step 5: SNP Annotation  
•	Step 6: VCF to Eigenstrat format  
•	Step 7: SNPweights Infer Ancestry

flowchart TD
	p1((Sample\nAlignment File))
	p2[SAMTOOLS_INDEX]
	p3[BCFTOOLS_MPILEUP]
	p4[BCFTOOLS_CALL]
	p5[BCFTOOLS_FILTER]
    p6[BCFTOOLS_ANNOTATE]
    p7[VCF2EIGENSTRAT]
    p8[SNPWEIGHTS_INFERANC]
    
    o1([Genetic Ancestry Estimation]):::output
 
    p1 --> p2
    p2 --> p3
    p3 --> p4
    p4 --> p5
    p5 --> p6
    p6 --> p7
    p7 --> p8
    p8 --> o1

classDef output fill:#90aaff,stroke:#6c8eff,stroke-width:2px,color:#000000

Parameters for Ancestry Pipeline

--pubdir
- Default: /<PATH>
- Comment: The directory that the saved outputs will be stored.
--organize_by
- Default: sample
- Comment: How to organize the output folder structure. Options: sample or analysis.
-w
- Default: /<PATH>
- Comment: The directory that all intermediary files and nextflow processes utilize. This directory can become quite large. This should be a location on /fastscratch or other directory with ample storage.
--sample_folder
- Default: /<PATH>
- Comment: The path to the folder that contains all the samples to be run by the pipeline. The files in this path can also be symbolic links.
--csv_input
- Default: null
- Comment: Provide a CSV manifest file with the header: "sampleID,bam". See below for an example file.
- --download_data can be specified.
--gen_org
- Default: human
- Comment: Options: human.
--ref_fa
- Default: '/projects/omics_share/human/GRCh38/genome/sequence/gatk/Homo_sapiens_assembly38.fasta'
- Comment: The reference fasta to be used throughout the process for alignment as well as any downstream analysis. JAX users should not change this parameter.
--genotype_targets
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file for the ancestry panel. Can contain annotation information.
--snpID_list
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2.list'
- Comment: Target SNPs in list used in BCFtools filtering step.
--snp_annotations
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/snp_panel_v2_targets_annotations.snpwt.bed.gz'
- Comment: Target SNP bed file with annotations for the ancestry panel.
--snpweights_panel
- Default: '/projects/compsci/omics_share/human/GRCh38/supporting_files/ancestry_panel/ancestry_panel_v2.snpwt'
- Comment: SNP weights panel in the appropriate format.

Pipeline Default Outputs

NOTE: * Represents a wild card that is a placeholder for values that will be filled by input file names and/or parameters when the pipeline is run.

Naming Convention	Description
`ancestry_report.html`	Nextflow autogenerated report
`trace.txt`	Nextflow trace of processes
`*.ancestry.tsv`	Genetic ancestry report. See https://www.biorxiv.org/content/10.1101/2022.10.24.513591v1 for details on report and methods

CSV Input Sample Sheet

The required input header is: sampleID,lane,fastq_1,fastq_2. Samples can be provided either paired or un-paired.

The sampleID column is a unique identifies for each individual sample, which is associated with other samples based on status and patient ID.
The lane column contains lane information for individual samples. If a single sample ID is provided with multiple lanes, the sequences from each lane will be concatenated prior to analysis.
The fastq_1 and fastq_2 columns must contain absolute paths or URLs to read 1 and read 2 from an Illumina paired-end sequencing run.

Basic examples:

An example csv file:

sampleID,bam
Sample_42,/path/to/sample_42_dedup_realigned.bam
Sample_101,Lane_1,/path/to/sample_101_dedup_realigned.bam
Sample_10191,Lane_1,/path/to/sample_10191_dedup_realigned.bam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancestry Pipeline ReadMe

Genetic Ancestry Estimation Documentation

Ancestry Pipeline (--workflow ancestry)

Parameters for Ancestry Pipeline

Pipeline Default Outputs

CSV Input Sample Sheet

Basic examples:

An example csv file:

Home

Pipeline Documentation

Benchmarking Documentation

Pipeline development and Release Documentation

Clone this wiki locally