-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
invariant vcf with ALT "*" instead of "." #105
Comments
Just to follow up - I was unaware github also had "closed issues" before posting. Seems like similar questions have been asked and answered there about --bypass_invariant_check, but my question does still seem unique so I will leave it up |
Hi, Sorry, I can't help with this issue. I'm just wondering how did you filtered invariants? Did you split the all sites vcf to variant and invariants, and then filtered each one separately followed by concatenate in one vcf file? Filtering of variants (SNPs) seems straitforward, but I'm just confused how to filter invariants. I have an invariants file with many missing data and when filtered it dropped more than 99% of the data ( from 116,000,000 to 1,000,000 lines). Any help will be appreciable thanks |
Hi Ali, I did split the allsites vcf to variant and invariant flies. I already had a variant file I made previously, so I focussed on creating the invariant file from the allsites vcf. Once I had the two files, I did concatenate them to create the filtered allsites vcf to use with pixy. First, I had to create and allsites vcf by calling allsites (variant and invariant SNPs) during the SNP calling step. This is the code I used:
I parallelized my code on SLURM, the first line after loading the modules just refers to a text file that has the individual chromosome names from my reference genome. The important addition here is To make my invariant vcf, I ended up creating a series of lists and then filtering based on those lists since trying to directly filter invariant sites using bcftools or vcftools wasn't working. The key thing I used was invariant sites are denoted by "*" in the ALT column. First, I filtered the allsites vcf based on quality (also tried to filter MAF = 0, but didn't work to remove variant sites)
Then, to create my lists, I used my already made variant vcf file as a guideline for different filtering thresholds
then, I used bcftools to create my invariant vcf by filtering my allsites vcf for those sites specifically, plus removing individuals with missing data that I knew based on my previous work creating my variant vcf
now I have an invariant vcf that has been filtered for missing data per site (-T command), missing data and low depth per individual (-S command). I will examine what individual and site depth looks like, to further filter the data by site depth, and then I will concatenate the invariant file with my previously filtered variant file (that has been used for all other analyses)
will generate a list of sites within a certain threshold (see what I used to filter variant data OR look at distributions in R) and will filter the invariant data (.ldepth.mean text file) based on that information to have my final invariant vcf (filtered by depth) that I will then concatenate to my filtered vcf (depth, quality, missing data and missing/low depth individuals) filtering values by column 3, selecting only the first two columns to print, and removing the header. Here I am filtering by a minimum depth of 10 and a maximum depth of 55, and this was selected based on my mean depth (~25x)
now you have a file ready for pixy
Hope this helps, I understand this might be overwhelming. But, it is how I accomplished generating an allsites vcf |
Hi Mackenzie, First of all, I'm sorry for late response and thanks so much for the detailed expalanition. I'm still stucking in the first step and I think I'm not fully understand how to parallelize my code on SLURM. I tried alot of things, but all of them gave an errors. Please see some attached examples of error files. The errors seem to be related to two main things:
error:
error: When using "-V gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/my_database12 " B- A USER ERROR has occurred: GenomicsDB workspace drivingVariantFile:gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/my_database12_ does not exist When using " -V gendb:///mnt/ursus/GROUP-sbifh3/c1845371/whole_genome/data_dog/align_out/my_database12_"$name" I think the most non-understandable for me is 'ch3_' in " V gendb://db/ch3_"$name" ". Is it the prefix? similar to NC_ in 'NC_051844.1$1$3937623' in my-database-12 file. What is the correct .txt file format of the four attached files to be linked to those in "my-database_12"? I have attached my bash script and other files for your reference. Once agian, thanks for your help and feel free to refuse to help if you are busy. Yours sincerely,
|
I have an allsites vcf, containing invariant and variant sites, that I have properly filtered (e.g., missing data, depth). Since I used vcftools at a certain stage [to update, I have tried without using vcftools and using only bcftools and the invariant sites still are transformed to "*"], the "." symbols in the ALT column (what bcftools recognizes as an invariant site) have been changed to "*". When running pixy, I got the warning/solution
"ERROR: the provided VCF appears to contain no invariant sites (ALT = "."). This check can be bypassed via --bypass_invariant_check 'yes'."
My question is, if I move forward with --bypass_invariant_check 'yes' will that impact the algorithms ability to differentiate between variant and invariant sites? And if so, is the best way forward to figure out how to change instances of "*" to "." in my allsite vcf ALT column.
The text was updated successfully, but these errors were encountered: