-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it valid VCF not to 'squash' positions with more than one ALT allele? #52
Comments
Here is another confusing case:
Actually I don't know how to interpret this at all. At first it seems to call me CG, then it calls me TT? Something seems wrong (of course it could be my understanding ;-) |
Not sure how useful it is to keep sharing random examples:
I'm guessing this is calling CA? here is my code: from collections import defaultdict
from cyvcf2 import VCF
import logging
logging.basicConfig(level=logging.INFO)
imputed_vcf_files = []
for chr in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,'X']:
imputed_vcf_files.append("/home/dan/Downloads/90365083240/" +
f"Sanger Imputation Server/hrc-eagle2.vcfs/{chr}.vcf.gz")
for f in imputed_vcf_files:
logging.info(f"opening vcf '{f}'")
# These should be global, but put here to save memory on my laptop.
imputed_snps_by_pos = defaultdict(dict)
imputed_snps_by_ids = dict()
imputed_snps_to_kil = dict()
what_to_call = defaultdict(int)
for v in VCF(f):
# Same position, different rsID?
if v.POS in imputed_snps_by_pos[v.CHROM]:
x = imputed_snps_by_ids[
imputed_snps_by_pos[v.CHROM][v.POS]
]
if v.ID != x.ID:
if v.ID is None:
logging.debug(
f"We fixed an ID from the imputaion file: {v.POS}: {v.ID} {x.ID}")
v.ID = x.ID
else:
logging.info(f"Same positions, different id: {v.POS}: {v.ID} {x.ID}")
imputed_snps_to_kil[v.ID] = True
imputed_snps_to_kil[x.ID] = True
else:
# There are reasons for this (see below)
if v.ALT == x.ALT:
logging.warning(f"WHAT?: {v.ID} {x.ID}")
if v.ID is None:
logging.debug(
f"Missing ID in the imputaion file: {v.POS}: {v.ID}")
continue
if v.ID in imputed_snps_by_ids:
x = imputed_snps_by_ids[v.ID]
imputed_snps_to_kil[v.ID] = True
imputed_snps_to_kil[x.ID] = True
# I'd expect anything at this point...
assert v.CHROM == x.CHROM, v.ID
# First weirdness
if v.POS != x.POS:
dist = v.POS - x.POS
logging.info(
f"Same id, different positions: '{v.ID}': {v.POS}, {x.POS}, {dist}"
)
else:
# Multiple ALT alleles are represented as bi-allelic
# in the imputation output (each alt allele is put on
# a new line).
# This would be fine, but how to interpret the actual
# genotype call?
# First, check it's only ever ref or (single) alt (the
# file would be invalid VCF otherwise)
assert v.genotypes[0][0] in [0, 1], v.ID
assert v.genotypes[0][1] in [0, 1], v.ID
assert x.genotypes[0][0] in [0, 1], v.ID
assert x.genotypes[0][1] in [0, 1], v.ID
# I've given up trying to interpret the actual
# genotype, lets just log them:
what_to_call[ str(v.genotypes[0][0]) +
str(v.genotypes[0][1]) +
str(x.genotypes[0][0]) +
str(x.genotypes[0][1]) ] += 1
imputed_snps_by_pos[v.CHROM][v.POS] = v.ID
imputed_snps_by_ids[v.ID] = v
for genotype, count in what_to_call.items():
print(genotype, count) |
Hello Dan. From the perspective of PBWT (and all other phasing/imputation tools I know), it is necessary for variants to be biallelic. So there is a choice to either drop multi-allelic sites, or split them as in your examples. There is also a scientific interpretation of this. Each mutation is an atomic event that creates (at most) one new allele. ("At most" because it is possible that one mutation reverts a previous mutation at a site, or two independent mutations create the same alternative allele.) This means that, in order to get a tri-allelic site there must be (at least) two mutations. In principle we could think of the two VCF lines as indicating the separate state of those two mutations. There are a couple of problems with this interpretation: first, the reference is not always the ancestral allele for both mutations; second, the genotypes at the two coincident sites are not independent, certain combinations are not permitted. This is a very similar issue to when a point (biallelic) variant sits within a deletion variant. Anyway, this is a long way of saying that PBWT, and other programs like IMPUTE, MACS, EAGLE etc. can all handle having two biallelic variants at the same site, and can not handle triallelic sites. In order to run them we either drop the multi-allelic sites, or split them into combinations of biallelic sites. |
Thanks for this detailed explanation. However, a question remains, is it valid VCF? I take your point that this creates complex dependencies between states, but I simply (naively?) want to interpret the results as being in one state or another. Given the scores are the same for the different lines, does this mean that either possible genotype at that position is equally likely in my genome (based on a given population)? Just to be absolutely clear (probably a sign that I'm confused ;-), what is my imputed genotype at this position:
Is it GC or TT? |
Here are the range and count of bi-allelic genotype calls I'm seeing on chromosome three from the code above:
If you could tell me how to interpret each of these that would be great. Please note, these could be tri- / quad-allelic SNPs, but I'm only ever collecting data as a pair, first position = 1st allele 2nd position = 2nd allele. |
Something like this?
|
Any follow up? Is the problem clear? |
If you want to model multiallelic and complex nested variation, you can
build a variation graph and apply the GBWT. This implicitly generalizes the
PBWT to multiallelic states. But you don't have a global order anymore.
…On Mon, Mar 14, 2022, 15:30 Dan Bolser ***@***.***> wrote:
Any follow up?
—
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEOJ5G337MW7RUYUZNDU75EPZANCNFSM5QJL777Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Many thanks for your comments Erik.
Is this what the complex alleles are trying to tell me? e.g.
Or are you saying the data is actually missing from the VCF (or indeed the Thanks for patience with these beginner questions, I find VCF mind-bending Would a call be better to discuss this? Cheers, |
Coming back to this, is it just a bug and you don't want to say? |
I'm seeing output that looks like this:
The first line says that my genome is CC at this position, but the second line (for the second alt allele) says that my genome is CT at this position. OK, the first line couldn't say this, so it calls me as REF/REF, but this call has to be interpreted in the context of the second call, and can't be taken at face value. That's why I wonder if this is valid VCF?
I think the line should be 'squashed' down to the following (I think):
The text was updated successfully, but these errors were encountered: