Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invariant vcf with ALT "*" instead of "." #105

Open
macurqcron opened this issue Apr 19, 2024 · 4 comments
Open

invariant vcf with ALT "*" instead of "." #105

macurqcron opened this issue Apr 19, 2024 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@macurqcron
Copy link

macurqcron commented Apr 19, 2024

I have an allsites vcf, containing invariant and variant sites, that I have properly filtered (e.g., missing data, depth). Since I used vcftools at a certain stage [to update, I have tried without using vcftools and using only bcftools and the invariant sites still are transformed to "*"], the "." symbols in the ALT column (what bcftools recognizes as an invariant site) have been changed to "*". When running pixy, I got the warning/solution "ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'."

My question is, if I move forward with --bypass_invariant_check 'yes' will that impact the algorithms ability to differentiate between variant and invariant sites? And if so, is the best way forward to figure out how to change instances of "*" to "." in my allsite vcf ALT column.

@macurqcron macurqcron added the help wanted Extra attention is needed label Apr 19, 2024
@macurqcron
Copy link
Author

macurqcron commented Apr 19, 2024

Just to follow up - I was unaware github also had "closed issues" before posting. Seems like similar questions have been asked and answered there about --bypass_invariant_check, but my question does still seem unique so I will leave it up

@AliBasuony2022
Copy link

Hi,

Sorry, I can't help with this issue. I'm just wondering how did you filtered invariants? Did you split the all sites vcf to variant and invariants, and then filtered each one separately followed by concatenate in one vcf file? Filtering of variants (SNPs) seems straitforward, but I'm just confused how to filter invariants. I have an invariants file with many missing data and when filtered it dropped more than 99% of the data ( from 116,000,000 to 1,000,000 lines).

Any help will be appreciable

thanks
Ali

@macurqcron
Copy link
Author

Hi Ali,

I did split the allsites vcf to variant and invariant flies. I already had a variant file I made previously, so I focussed on creating the invariant file from the allsites vcf. Once I had the two files, I did concatenate them to create the filtered allsites vcf to use with pixy.

First, I had to create and allsites vcf by calling allsites (variant and invariant SNPs) during the SNP calling step. This is the code I used:

module load bwa/0.7.17
module load samtools/1.16.1
module load gatk/4.2.4.0
module load vcftools/0.1.16

name=$( awk "NR==$SLURM_ARRAY_TASK_ID" ./ref_genome/chromosomelist_vcf.txt)

gatk GenotypeGVCFs \
	-R ./ref_genome/LF10g_v2.0.fa \
	-V gendb://db/ch3_"$name" \
	-all-sites -L "$name" -O ./vcf/allsites_"$name".vcf.gz

I parallelized my code on SLURM, the first line after loading the modules just refers to a text file that has the individual chromosome names from my reference genome. The important addition here is
-all-sites -L "$name"
without that the code would only keep variant sites (perhaps explaining why your invariant file was missing so much data?).

To make my invariant vcf, I ended up creating a series of lists and then filtering based on those lists since trying to directly filter invariant sites using bcftools or vcftools wasn't working. The key thing I used was invariant sites are denoted by "*" in the ALT column.

First, I filtered the allsites vcf based on quality (also tried to filter MAF = 0, but didn't work to remove variant sites)

module load bcftools/1.16

cd vcf

    bcftools view \
    -S ^filtered_indv_mean_depth.txt
    -i 'MAF = 0 && F_PASS(FORMAT/DP>0) > 0.8' \
    -m 2 \
    -M 2 \
    -O z \
    allsites_ch3_full_genome.vcf.gz > allsites_ch3_filtered_md.vcf.gz

Then, to create my lists, I used my already made variant vcf file as a guideline for different filtering thresholds

bcftools view -H allsites_ch3_filtered_md.vcf.gz  | wc -l 
# 338166
# count # of sites in variant vcf
bcftools view -H ch3_full_genome_filtered_ql.vcf.gz | wc -l
# 186746

# extracting chromosome, position, REF, ALT column data
bcftools query -f '%CHROM %POS %REF %ALT\n' allsites_ch3_filtered_md.vcf.gz -o allsites_ch3_filtered_md.txt

# creating text file filtered to only include invariant sites (defined by ALT = "*")
awk '$4 == "*"' allsites_ch3_filtered_md.txt > allsites_invariantsites_ch3_filtered_md.txt

# 
wc -l allsites_invariantsites_ch3_filtered_md.txt
# 98407 sites are invariant (after filtering for missing data)

# created a list with specific sites that are invariant to filter by 
awk '{print $1 "\t" $2}' allsites_invariantsites_ch3_filtered_md.txt > invariant_sites_md_filter.txt

then, I used bcftools to create my invariant vcf by filtering my allsites vcf for those sites specifically, plus removing individuals with missing data that I knew based on my previous work creating my variant vcf

bcftools view \
    -T invariant_sites_md_filter.txt \
    -S ^filtered_indv_mean_depth.txt \
    -O z \
     allsites_ch3_filtered_md.vcf.gz > invariant_ch3_filtered_md_indv.vcf.gz

now I have an invariant vcf that has been filtered for missing data per site (-T command), missing data and low depth per individual (-S command). I will examine what individual and site depth looks like, to further filter the data by site depth, and then I will concatenate the invariant file with my previously filtered variant file (that has been used for all other analyses)

module load StdEnv/2020
module load vcftools/0.1.16

## Calculate mean depth per individual 
vcftools --gzvcf invariant_ch3_filtered_md_indv --depth --out invariant_ch3_filtered_md_indv

## Calculate mean depth per site
vcftools --gzvcf invariant_ch3_filtered_md_indv --site-mean-depth --out invariant_ch3_filtered_md_indv

will generate a list of sites within a certain threshold (see what I used to filter variant data OR look at distributions in R) and will filter the invariant data (.ldepth.mean text file) based on that information to have my final invariant vcf (filtered by depth) that I will then concatenate to my filtered vcf (depth, quality, missing data and missing/low depth individuals)

filtering values by column 3, selecting only the first two columns to print, and removing the header. Here I am filtering by a minimum depth of 10 and a maximum depth of 55, and this was selected based on my mean depth (~25x)

awk 'NR==1 {print $1 "\t" $2}; NR>1 && $3 > 10 && $3 < 55 {print $1 "\t" $2}' <(tail -n +2 invariant_ch3_filtered_md_indv.ldepth.mean) > invariant_sites_depth_filter.txt

# filter invariant vcf to specific depth thresholds using list from above
bcftools view \
    -T invariant_sites_depth_filter.txt \
    -O z \
     invariant_ch3_filtered_md_indv.vcf.gz > invariant_ch3_filtered_md_indv_depth.vcf.gz
     
 # index vcf files
# invariant vcf
     bcftools index invariant_ch3_filtered_md_indv_depth.vcf.gz 
# variant vcf
     bcftools index ch3_full_genome_filtered_site-indv_dp.ql.vcf.gz

# final step, concatenating the two files for a filtered, allsites vcf! 

bcftools concat \
--allow-overlaps \
invariant_ch3_filtered_md_indv_depth.vcf.gz ch3_full_genome_filtered_site-indv_dp.ql.vcf.gz \
-O z -o pixy_invariant_variant_filtered.vcf.gz

# index your allsites filtered vcf
tabix pixy_invariant_variant_filtered.vcf.gz

now you have a file ready for pixy
here is my code for running pixy, note you will need the line --bypass_invariant_check 'yes' since the invariant sites are not coded the way pixy expects, but they exist.

pixy --stats pi dxy \
--vcf pixy_ch3_invariant_variant_filtered.vcf.gz \
--populations pixy_PopFile_ch3.txt \
--window_size 1000000 \
--n_cores 8 \
--bypass_invariant_check 'yes' \
--chromosomes 'LF10_chr1,LF10_chr2,LF10_chr3,LF10_chr4,LF10_chr5,LF10_chr6,LF10_chr7,LF10_chr8'

Hope this helps, I understand this might be overwhelming. But, it is how I accomplished generating an allsites vcf

@AliBasuony2022
Copy link

AliBasuony2022 commented Nov 3, 2024

Hi Mackenzie,

First of all, I'm sorry for late response and thanks so much for the detailed expalanition.

I'm still stucking in the first step and I think I'm not fully understand how to parallelize my code on SLURM. I tried alot of things, but all of them gave an errors. Please see some attached examples of error files.

The errors seem to be related to two main things:

  1. Using of awk, (I have googled about this but didn't find a solution, sorry).

error:
awk: cmd. line:2: NR==
awk: cmd. line:2: ^ unexpected newline or end of string

  1. linking the chromosome names to those in the database (my-database_12) directory.

error:
A- A USER ERROR has occurred: Badly formed genome unclippedLoc: Failed to parse Genome Location string: : intervalQueryString should not be empty.

When using "-V gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/my_database12 "

B- A USER ERROR has occurred: GenomicsDB workspace drivingVariantFile:gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/my_database12_ does not exist

When using " -V gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/my_database12_"$name"

I think the most non-understandable for me is 'ch3_' in " V gendb://db/ch3_"$name" ". Is it the prefix? similar to NC_ in 'NC_051844.1$1$3937623' in my-database-12 file.

What is the correct .txt file format of the four attached files to be linked to those in "my-database_12"?

I have attached my bash script and other files for your reference.

Once agian, thanks for your help and feel free to refuse to help if you are busy.

Yours sincerely,
Ali

to macurqcron.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants