-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix known bugs in genotyping and look at confidence distribution #11
Comments
At this commit on branch fix_poisson: commit 85c17cc
Results are looking pretty interesting. I've taken a binary of NA12878 (this is a human sample, and have used Platinum genome data, cleaning with threshold 7), and intersected with 5000 random OMNI sites. I then compare with the 1000g calls at those sites as truth I then calculate confusion matrices of genotyping for v1.0.5.21 of Cortex and for this commit, and compare them. To clarify the signal, I produce confusion matrices for calls with genotype confidence between 0 and 9, and for those between 10 and 19, and... Notice that we now put more stuff in the low confidence bin, but by the time confidence>10, in the new commit, we lose much fewer hets, but we do call more as missing (which is bizarre). So apart from the extra missingness, I am v happy |
Oh no it's OK! There are a bunch of sites which we call as missing - in the old commit they have low confidence and now they have higher. I think the missingness is likely because these OMNI sites are actually adjacent to indels. There is a separate question as to what the confidence of a ./. call is - will look into this |
Merged into master here |
I just want to check the genotyping of longer alleles and then ok to close. All unit tests pass including with valgrind. |
'There is a separate question as to what the confidence of a ./. call is - will look into this' Would indeed be great if we could distinguish between truly missing (deletion) and uncertainty in calling. Back to my lectures now, just had to add this. |
OK, so in two parts. Why does a ./. call have a confidence? When Cortex genotypes a bubble, you pass it a model. That model is diploid or haploid at the moment, and allows you to be either hom-ref, het, or hom-alt (for diploid, haploid is either hom-ref or hom-alt). Now suppose one allele is longer than another. And suppose we have zero covg on both alleles. Well then, the most likely model is homozygous for the shorter allele, as more chance of having zero covg on two short alleles. So, Cortex types it thus. Then, a wrapper script converts the cortex output to a VCF, and forcibly sets genotype to ./. if there is zero coverage on both alleles, as that is what most consumers want, but leaves the confidence unchanged. Can we distinguish truly missing from uncertainty in calling |
To fix:
The text was updated successfully, but these errors were encountered: