Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it valid VCF not to 'squash' positions with more than one ALT allele? #52

Open
CholoTook opened this issue Mar 9, 2022 · 10 comments

Comments

@CholoTook
Copy link

CholoTook commented Mar 9, 2022

I'm seeing output that looks like this:

3	112841	rs78923776	C	G	.	PASS	RefPanelAF=0.000153988;AN=2;AC=0;INFO=1	GT:ADS:DS:GP	0|0:0,0:0:1,0,0
3	112841	rs78923776	C	T	.	PASS	RefPanelAF=0.078534;AN=2;AC=1;INFO=1	GT:ADS:DS:GP	0|1:0,1:1:0,1,0

The first line says that my genome is CC at this position, but the second line (for the second alt allele) says that my genome is CT at this position. OK, the first line couldn't say this, so it calls me as REF/REF, but this call has to be interpreted in the context of the second call, and can't be taken at face value. That's why I wonder if this is valid VCF?

I think the line should be 'squashed' down to the following (I think):

3	112841	rs78923776	C	G,T	.	PASS	RefPanelAF=0.000153988,0.078534;AN=3;AC=0,1;INFO=1	GT:ADS:DS:GP	0|2:0,0:0:1,0,0
@CholoTook
Copy link
Author

Here is another confusing case:

3	255395	rs331869	C	G	.	PASS	RefPanelAF=0.385864;AN=2;AC=1;INFO=1	GT:ADS:DS:GP	1|0:1,0:1:0,1,0
3	255395	rs331869	C	T	.	PASS	RefPanelAF=0.980998;AN=2;AC=2;INFO=1	GT:ADS:DS:GP	1|1:1,1:2:0,0,1

Actually I don't know how to interpret this at all. At first it seems to call me CG, then it calls me TT?

Something seems wrong (of course it could be my understanding ;-)

@CholoTook
Copy link
Author

CholoTook commented Mar 9, 2022

Not sure how useful it is to keep sharing random examples:

3	6010536	rs2048924	T	C	.	PASS	RefPanelAF=0.962966;AN=2;AC=2;INFO=1	GT:ADS:DS:GP	1|1:1,1:2:0,0,1
3	6010536	rs2048924	T	A	.	PASS	RefPanelAF=0.955544;AN=2;AC=2;INFO=1	GT:ADS:DS:GP	1|1:1,1:2:0,0,1

I'm guessing this is calling CA?

here is my code:

from collections import defaultdict

from cyvcf2 import VCF

import logging
logging.basicConfig(level=logging.INFO)

imputed_vcf_files = []

for chr in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,'X']:
    imputed_vcf_files.append("/home/dan/Downloads/90365083240/" +
    f"Sanger Imputation Server/hrc-eagle2.vcfs/{chr}.vcf.gz")

for f in imputed_vcf_files:
    logging.info(f"opening vcf '{f}'")

    # These should be global, but put here to save memory on my laptop.
    imputed_snps_by_pos = defaultdict(dict)
    imputed_snps_by_ids = dict()
    imputed_snps_to_kil = dict()

    what_to_call = defaultdict(int)

    for v in VCF(f):

        # Same position, different rsID?
        if v.POS in imputed_snps_by_pos[v.CHROM]:
            x = imputed_snps_by_ids[
                imputed_snps_by_pos[v.CHROM][v.POS]
            ]

            if v.ID != x.ID:
                if v.ID is None:
                    logging.debug(
                        f"We fixed an ID from the imputaion file: {v.POS}: {v.ID} {x.ID}")
                    v.ID = x.ID
                else:
                    logging.info(f"Same positions, different id: {v.POS}: {v.ID} {x.ID}")
                    imputed_snps_to_kil[v.ID] = True
                    imputed_snps_to_kil[x.ID] = True
            else:
                # There are reasons for this (see below)
                if v.ALT == x.ALT:
                    logging.warning(f"WHAT?: {v.ID} {x.ID}")

        if v.ID is None:
            logging.debug(
                f"Missing ID in the imputaion file: {v.POS}: {v.ID}")
            continue

        if v.ID in imputed_snps_by_ids:
            x = imputed_snps_by_ids[v.ID]
            imputed_snps_to_kil[v.ID] = True
            imputed_snps_to_kil[x.ID] = True

            # I'd expect anything at this point...
            assert v.CHROM == x.CHROM, v.ID

            # First weirdness
            if v.POS != x.POS:
                dist = v.POS - x.POS
                logging.info(
                    f"Same id, different positions: '{v.ID}': {v.POS}, {x.POS}, {dist}"
                )

            else:
                # Multiple ALT alleles are represented as bi-allelic
                # in the imputation output (each alt allele is put on
                # a new line).

                # This would be fine, but how to interpret the actual
                # genotype call?

                # First, check it's only ever ref or (single) alt (the
                # file would be invalid VCF otherwise)
                assert v.genotypes[0][0] in [0, 1], v.ID
                assert v.genotypes[0][1] in [0, 1], v.ID
                assert x.genotypes[0][0] in [0, 1], v.ID
                assert x.genotypes[0][1] in [0, 1], v.ID

                # I've given up trying to interpret the actual
                # genotype, lets just log them:

                what_to_call[ str(v.genotypes[0][0]) +
                              str(v.genotypes[0][1]) +
                              str(x.genotypes[0][0]) +
                              str(x.genotypes[0][1]) ] += 1

        imputed_snps_by_pos[v.CHROM][v.POS] = v.ID
        imputed_snps_by_ids[v.ID] = v

    for genotype, count in what_to_call.items():
        print(genotype, count)

@richarddurbin
Copy link
Owner

Hello Dan. From the perspective of PBWT (and all other phasing/imputation tools I know), it is necessary for variants to be biallelic. So there is a choice to either drop multi-allelic sites, or split them as in your examples. There is also a scientific interpretation of this. Each mutation is an atomic event that creates (at most) one new allele. ("At most" because it is possible that one mutation reverts a previous mutation at a site, or two independent mutations create the same alternative allele.) This means that, in order to get a tri-allelic site there must be (at least) two mutations. In principle we could think of the two VCF lines as indicating the separate state of those two mutations. There are a couple of problems with this interpretation: first, the reference is not always the ancestral allele for both mutations; second, the genotypes at the two coincident sites are not independent, certain combinations are not permitted. This is a very similar issue to when a point (biallelic) variant sits within a deletion variant.

Anyway, this is a long way of saying that PBWT, and other programs like IMPUTE, MACS, EAGLE etc. can all handle having two biallelic variants at the same site, and can not handle triallelic sites. In order to run them we either drop the multi-allelic sites, or split them into combinations of biallelic sites.

@CholoTook
Copy link
Author

CholoTook commented Mar 9, 2022

Thanks for this detailed explanation. However, a question remains, is it valid VCF?

I take your point that this creates complex dependencies between states, but I simply (naively?) want to interpret the results as being in one state or another.

Given the scores are the same for the different lines, does this mean that either possible genotype at that position is equally likely in my genome (based on a given population)?

Just to be absolutely clear (probably a sign that I'm confused ;-), what is my imputed genotype at this position:

3	255395	rs331869	C	G	.	PASS	RefPanelAF=0.385864;AN=2;AC=1;INFO=1	GT:ADS:DS:GP	1|0:1,0:1:0,1,0
3	255395	rs331869	C	T	.	PASS	RefPanelAF=0.980998;AN=2;AC=2;INFO=1	GT:ADS:DS:GP	1|1:1,1:2:0,0,1

Is it GC or TT?

@CholoTook
Copy link
Author

Here are the range and count of bi-allelic genotype calls I'm seeing on chromosome three from the code above:

|<----- Allele 1, 1st position (0 = REF, 1 = ALT)
||<---- Allele 1, 2nd position
|||<--- Allele 2, 1st position
||||<-- Allele 2, 2nd position 
0000 5040
0001 103
0010 103
0011 98
0100 87
0101 3
0110 1
0111 7
1000 69
1001 3
1010 1
1011 13
1100 88
1101 3
1110 6
1111 13

If you could tell me how to interpret each of these that would be great.

Please note, these could be tri- / quad-allelic SNPs, but I'm only ever collecting data as a pair, first position = 1st allele 2nd position = 2nd allele.

@CholoTook
Copy link
Author

Something like this?

          0000 = REF/REF
           
           1000 = ALT1/REF
           0100 = REF/ALT1
           
           0010 = ALT2/REF
           0001 = REF/ALT2

           1100 = ALT1/ALT1
           0011 = ALT2/ALT2

           0101 = ALT1/ALT2 ??
           1010 = ALT1/ALT2 ??

           0110 = ??
           1001 = ??

           0111 = ??

           1011 = ??
           1101 = ??
           1110 = ??

           1111 = ??

@CholoTook
Copy link
Author

CholoTook commented Mar 14, 2022

Any follow up?

Is the problem clear?

@ekg
Copy link

ekg commented Mar 15, 2022 via email

@CholoTook
Copy link
Author

Many thanks for your comments Erik.

... multiallelic and complex nested variation

Is this what the complex alleles are trying to tell me? e.g.

  • x y z G C ... 1|1 + x y z G T ... 1|1

Or are you saying the data is actually missing from the VCF (or indeed the
panel itself) and this mess is just a consequence of the format (data) not
being able to model the reality?

Thanks for patience with these beginner questions, I find VCF mind-bending
at the best of times ;-)

Would a call be better to discuss this?

Cheers,
Dan.

@CholoTook
Copy link
Author

Thanks for this detailed explanation. However, a question remains, is it valid VCF?

I take your point that this creates complex dependencies between states, but I simply (naively?) want to interpret the results as being in one state or another.

Given the scores are the same for the different lines, does this mean that either possible genotype at that position is equally likely in my genome (based on a given population)?

Just to be absolutely clear (probably a sign that I'm confused ;-), what is my imputed genotype at this position:

3	255395	rs331869	C	G	.	PASS	RefPanelAF=0.385864;AN=2;AC=1;INFO=1	GT:ADS:DS:GP	1|0:1,0:1:0,1,0
3	255395	rs331869	C	T	.	PASS	RefPanelAF=0.980998;AN=2;AC=2;INFO=1	GT:ADS:DS:GP	1|1:1,1:2:0,0,1

Is it GC or TT?

Coming back to this, is it just a bug and you don't want to say?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants