The genomic QC pipeline is designed to clean and prepare imputed genotype data for pQTL analysis. All the rules were based on Alessia Mapelli’s work: [https://github.com/ht-diva/pqtl-believe-interval/blob/main/Script_QC_INTERVAL_genomics.R]
- Singularity
see also environment.yml and Makefile
- git clone https://github.com/ht-diva/genomics_QC_pipeline.git
- cd genomics_QC_pipeline
- adapt the submit.sbatch and config/config.yaml files to your environment
- sbatch submit.sbatch
The output is written to the path defined by the workspace_path variable in the config.yaml file. By default, this path is ./results
.
-
list_rs:
Purpose: Generate lists of all rsIDs and pseudo biallelic variants from the initial pgen file.
Output: Two files – one containing all rsIDs and the other containing pseudo biallelic variants. -
recode_pgen:
Purpose: Replace the IDs in the imputed pgen file with a new format: chr:pos:ref:alt.
Output: An updated pgen file with the new ID format. -
selected_sample:
Purpose: Select individuals who are present in both the 2018 data and have corresponding proteomic data.
Output: A filtered list of individuals. -
filter_var:
Purpose: Perform several quality control steps: remove additional failed samples, identify and remove heterozygosity outliers, perform minor allele frequency (MAF) filtering, remove related samples based on Hardy-Weinberg equilibrium (HWE).
Output: A cleaned dataset with high-quality variants and samples. -
create_bgen:
Purpose: Convert the filtered data from the previous steps into bgen format, a commonly used format for storing large-scale genotype data.
Output: A bgen file containing the cleaned genotype data. -
qctool:
Purpose: Compute SNP statistics using qctool, ensuring the quality of the variants.
Output: SNP statistics file. -
get_hq_variants:
Purpose: Filter variants to retain only those with an info score greater than 0.7.
Output: A list of high-quality variants. -
filter_hq_variants:
Purpose: Extract SNPs with an info score greater than 0.7 from the pgen file and create a new pgen file for each chromosome.
Output: pgen files for each chromosome containing only high-quality variants. -
merge_filter_hq_variants:
Purpose: Merge the chromosome-specific pgen files from the previous step into a single pgen file.
Output: A combined pgen file containing high-quality variants from all chromosomes. -
update_pgen_id:
Purpose: Update the variant IDs in the pgen file to the format chr:pos:A0:A1, with A0 and A1 in alphabetical order.
Output: An updated pgen file with harmonised IDs. -
update_pgen_alleles:
Purpose: Harmonize the alleles in the pgen file to match the new IDs.
Output: A pgen file with harmonised alleles. -
merge_filter_hq_variants_new_id_alleles_pgen:
Purpose: Merge all the pgen files from the previous step into a final single pgen file.
Output: A final combined pgen file with harmonised IDs and alleles, ready for pQTL analysis. -
pgen2bed:
Purpose: Convert pgen file into bed format. Set hard-call-threshold equal to 0.49999999.
Output: A bed file with harmonised alleles and minimized missing dosage. -
merge_filter_hq_variants_new_id_alleles_bed:
Purpose: Merge all the bed files from the previous step into a final single bed file.
Output: A final combined bed file.
- pgen folder (contains raw pgen files with the new IDs format: chr:pos:ref:alt)
- qc_recoded subfolder (contains pgen files that have been processed through quality control and recoding steps but not yet harmonised.)
- qc_recoded_harmonised subfolder (contains pgen files that have been both quality controlled, recoded, and harmonised.)
- bed folder
- qc_recoded_harmonised subfolder (contains bed files that have been harmonised.)