genomics_QC_pipeline

The genomic QC pipeline is designed to clean and prepare imputed genotype data for pQTL analysis. All the rules were based on Alessia Mapelli’s work: [https://github.com/ht-diva/pqtl-believe-interval/blob/main/Script_QC_INTERVAL_genomics.R]

Requirements

Singularity

Getting started

git clone https://github.com/ht-diva/genomics_QC_pipeline.git
cd genomics_QC_pipeline
adapt the submit.sbatch and config/config.yaml files to your environment
sbatch submit.sbatch

The output is written to the path defined by the workspace_path variable in the config.yaml file. By default, this path is ./results.

Rules description

list_rs:
Purpose: Generate lists of all rsIDs and pseudo biallelic variants from the initial pgen file.
Output: Two files – one containing all rsIDs and the other containing pseudo biallelic variants.
recode_pgen:
Purpose: Replace the IDs in the imputed pgen file with a new format: chr:pos:ref:alt.
Output: An updated pgen file with the new ID format.
selected_sample:
Purpose: Select individuals who are present in both the 2018 data and have corresponding proteomic data.
Output: A filtered list of individuals.
filter_var:
Purpose: Perform several quality control steps: remove additional failed samples, identify and remove heterozygosity outliers, perform minor allele frequency (MAF) filtering, remove related samples based on Hardy-Weinberg equilibrium (HWE).
Output: A cleaned dataset with high-quality variants and samples.
create_bgen:
Purpose: Convert the filtered data from the previous steps into bgen format, a commonly used format for storing large-scale genotype data.
Output: A bgen file containing the cleaned genotype data.
qctool:
Purpose: Compute SNP statistics using qctool, ensuring the quality of the variants.
Output: SNP statistics file.
get_hq_variants:
Purpose: Filter variants to retain only those with an info score greater than 0.7.
Output: A list of high-quality variants.
filter_hq_variants:
Purpose: Extract SNPs with an info score greater than 0.7 from the pgen file and create a new pgen file for each chromosome.
Output: pgen files for each chromosome containing only high-quality variants.
merge_filter_hq_variants:
Purpose: Merge the chromosome-specific pgen files from the previous step into a single pgen file.
Output: A combined pgen file containing high-quality variants from all chromosomes.
update_pgen_id:
Purpose: Update the variant IDs in the pgen file to the format chr:pos:A0:A1, with A0 and A1 in alphabetical order.
Output: An updated pgen file with harmonised IDs.
update_pgen_alleles:
Purpose: Harmonize the alleles in the pgen file to match the new IDs.
Output: A pgen file with harmonised alleles.
merge_filter_hq_variants_new_id_alleles_pgen:
Purpose: Merge all the pgen files from the previous step into a final single pgen file.
Output: A final combined pgen file with harmonised IDs and alleles, ready for pQTL analysis.
pgen2bed:
Purpose: Convert pgen file into bed format. Set hard-call-threshold equal to 0.49999999.
Output: A bed file with harmonised alleles and minimized missing dosage.
merge_filter_hq_variants_new_id_alleles_bed:
Purpose: Merge all the bed files from the previous step into a final single bed file.
Output: A final combined bed file.

Output

pgen folder (contains raw pgen files with the new IDs format: chr:pos:ref:alt)
- qc_recoded subfolder (contains pgen files that have been processed through quality control and recoding steps but not yet harmonised.)
- qc_recoded_harmonised subfolder (contains pgen files that have been both quality controlled, recoded, and harmonised.)
bed folder
- qc_recoded_harmonised subfolder (contains bed files that have been harmonised.)

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
config		config
slurm		slurm
workflow		workflow
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dag.svg		dag.svg
environment.yml		environment.yml
environment_dev.yml		environment_dev.yml
submit.sbatch		submit.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genomics_QC_pipeline

Requirements

Getting started

Rules description

Output

About

Releases

Packages

Contributors 3

Languages

ht-diva/genomics_QC_pipeline

Folders and files

Latest commit

History

Repository files navigation

genomics_QC_pipeline

Requirements

Getting started

Rules description

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages