Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local alleles #266

Merged
merged 13 commits into from
Jul 10, 2024
Merged

Local alleles #266

merged 13 commits into from
Jul 10, 2024

Conversation

Will-Tyler
Copy link
Contributor

@Will-Tyler Will-Tyler commented Jul 3, 2024

Description

In this pull request, vcf2zarr computes a genotype-level, local alleles field, LAA, during the explode step unless specifically told not to. The LAA field is a list of one-based indices into the variant-level ALT field that indicates which alleles are relevant (local) for the current sample. The source of truth for the LAA field is the GT, AD, and PL genotype fields.

vcf2zarr does not try to compute the LAA field if the LAA field is already present in a genotype entry.

With these changes, vcf2zarr computes the LAA field by default. To prevent vcf2zarr from introducing the LAA field, the user can specify --local-alleles false when running the vcf2zarr explode CLI.

This pull request makes progress on #185.

Testing

I updated and added several unit tests. I am happy to add more tests.

I was not able to use the bcftools tag2tag plugin as discussed in #185 because generating local-allele fields is not actually implemented in the plugin yet. (See here.)

Discussion

Format of the call_LAA field

To simplify the format of the call_LAA field, vcf2zarr always assigns three-dimensional values to the LAA field, filling space with the fill value, -2, where necessary. If none of the entries use alternate alleles, then each entry's LAA is [-2]. When an entry does use alternate alleles, the alternate allele indices come first in sorted order followed by any necessary fill values. For example, [2, 5, 6, -2, -2] is a valid LAA value.

Deciding when to localize

The issue (#185) seems to focus on the PL field. vcf2zarr only introduces the local-allele fields if the PL field is present in the VCF file, and the user has not specified to not introduce local-allele fields.

Inferring which alleles are used in the PL field

When the ploidy is one, the index of the PL corresponds to the index of the allele.

When the ploidy is two, then the index of the PL value corresponds to a diploid genotype. Given a positive PL value, we want to determine which alleles are in the genotype associated with the PL value. Given a genotype $a/b$, the index of the associated PL value is $$\frac{b(b+1)}{2} + a.$$

Suppose the index of the positive PL value is $n$. To determine $b$, we can momentarily set $a = b$ and solve $$n = \frac{b(b+1)}{2} + b.$$

We have $$2n = b^2 + 3b$$ $$\implies 2n + \frac{9}{4} = b^2 + 3b + \frac{9}{4}$$ $$\implies 2n + \frac{9}{4} = \left( b + \frac{3}{2} \right)^2$$ $$\implies \sqrt{2n + \frac{9}{4}} = b + \frac{3}{2}$$ $$\implies b = \sqrt{2n + \frac{9}{4}} - \frac{3}{2}.$$

Since $a$ can be less than $b$, we take the ceiling: $$b = \left\lceil \sqrt{2n + \frac{9}{4}} - \frac{3}{2} \right\rceil .$$

Determining $a$ is then simple: $$a = n - \frac{b(b+1)}{2}.$$

References

  • VCFv4.5 specification PDF
    • This VCF specification defines the LAA field as well as the other local-allele genotype fields. The latest version, v4.5, was apparently finalized and published last week.
  • The Scalable Variant Call Representation paper
    • This paper introduces the local-alleles approach for the Scalable Variant Call Representation and highlights the motivation for introducing local alleles.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 98.868% (-0.03%) from 98.895%
when pulling 8547ace on Will-Tyler:local-alleles
into 69ee0e8 on sgkit-dev:main.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 98.91% (+0.02%) from 98.895%
when pulling 8547ace on Will-Tyler:local-alleles
into 69ee0e8 on sgkit-dev:main.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start, good job! Some questions:

  1. Why do we just count the 1-based ALT alleles? It's the same thing to count all alleles, zero based isn't it?
  2. See the question about AD
  3. I'd like to see some non-trivial test data. Would you mind adding an example file based on the SVCR paper example? (Note that the du test will break when you add a new file - this is a bit clunky, sorry)
  4. I think the application to PLs is the important thing here. Otherwise, we don't really care about creating LAA at all, so I think we should try adding in LPL pretty soon so we're sure we're going in the right direction.

laa_val = [set() for _ in range(sample_count)]

if "GT" in variant.FORMAT:
for sample_index, genotype in enumerate(variant.genotypes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be too slow for production use I'm afraid - we can't have any Python loops over the samples, which is O(10^6). Can we do the same thing using numpy operations?

For example, would something like np.bincount(variant.genotypes[some_selector]) do the trick?

Also, might as well count the reference alleles as well, as we're at it?

bio2zarr/vcf2zarr/icf.py Show resolved Hide resolved
]
laa_val = np.array(laa_val)
tcw.append("FORMAT/LAA", laa_val)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style thing here, but I'd rather use an explicit else on the other case than a continue here. I tend to use these sparingly, and for "early exit" close to the top of the loop

bio2zarr/vcf2zarr/verification.py Show resolved Hide resolved
@@ -531,6 +545,36 @@ def test_call_AD(self, ds):
]
nt.assert_array_equal(ds.call_AD.values, call_AD)

def test_call_LAA(self, ds):
# The shape is (23, 3, 1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use np.full((23, 3, 1), -2, dtype=np.int32) here then

@Will-Tyler
Copy link
Contributor Author

Thanks for the feedback! Let me try to answer your questions here.

Why do we just count the 1-based ALT alleles? It's the same thing to count all alleles, zero based isn't it?

I assume we want the result of localizing genotype fields in vcf2zarr to be the same as if the genotype fields were already localized in the input VCF file according to the VCF specification. Therefore, the localized genotype fields produced by vcf2zarr should also follow the VCF specification.

The VCF specification dictates that if any local-allele genotype field is present, then the LAA field must be present. The LAA field is by definition a list of the one-based indices into the ALT variant field that indicates which alleles are observed in the sample at the variant.

I think staying consistent with the VCF specification helps keep the VCF Zarr representation simple. For example, if VCF Zarr data consumers encounter a call_LPL field, then they know to use the call_LAA field to determine which genotypes the values of the call_LPL field correspond to.

What's the reasoning for looking at AD as well?

The AD field might have positive values for alleles that are not called in the GT field. This is demonstrated in the LAA example in the v4.5 specification (page 14) reproduced here:

POS REF ALT FORMAT sample
1 G A,C,T,<*> GT:AD:PL 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120

I think inferring which alleles are observed with the PL field as well as you suggested is a good idea. I will look into this.

Would you mind adding an example file based on the SVCR paper example?

I am happy to add more tests—thanks for the suggestion!

I think the application to PLs is the important thing here. Otherwise, we don't really care about creating LAA at all, so I think we should try adding in LPL pretty soon so we're sure we're going in the right direction.

Good to know—I wanted to create a pull request after implementing the LAA field to get feedback, make sure I was on the right track, and keep the pull request size manageable. I will modify these changes so that vcf2zarr does not introduce local-allele fields unless the PL field is present in the input VCF. Once the LAA field changes look good, I can add the LPL field either as part of this pull request or a new pull request.


As I was thinking about this, I realized that vcf2zarr should not compute and add the LAA field if the LAA field is already present in the input VCF. I need to make some changes for that.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 98.882% (-0.01%) from 98.895%
when pulling 46cf7bb on Will-Tyler:local-alleles
into 69ee0e8 on sgkit-dev:main.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 98.881% (-0.01%) from 98.895%
when pulling 819820a on Will-Tyler:local-alleles
into 69ee0e8 on sgkit-dev:main.

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 98.922% (+0.03%) from 98.895%
when pulling 819820a on Will-Tyler:local-alleles
into 69ee0e8 on sgkit-dev:main.

@Will-Tyler
Copy link
Contributor Author

I made some changes, but I'm not quite ready for a new review. I want to try simplifying the compute_LAA_field method. I will use the request a new review GitHub feature when I am ready.

@coveralls
Copy link
Collaborator

coveralls commented Jul 8, 2024

Coverage Status

coverage: 98.885% (+0.03%) from 98.855%
when pulling 394b2ec on Will-Tyler:local-alleles
into 3e23768 on sgkit-dev:main.

a = (n - b * (b + 1) / 2).astype(int)
else:
# TODO: Handle all possible ploidy
raise ValueError(f"Cannot handle ploidy = {ploidy}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is causing the coverage to fail. Let me know if you want me to add a unit test for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to cover this case all right. Can we generate a simple triploid one-line VCF with a PL and index on the fly, like we do in the simulated tests?

We can log an issue as something to do in a follow up, if it's too messy though.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @Will-Tyler, I think you've basically nailed it. Some minor comments.

Were you planning on adding LPL in for this PR, or doing in a follow up? I'm happy either way.

bio2zarr/cli.py Outdated
@@ -149,6 +149,14 @@ def list_commands(self, ctx):
help="An approximate bound on overall memory usage (e.g. 10G),",
)

local_alleles = click.option(
"--local-alleles",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an explicit flag like @click.option('--local-alleles/--no-local-alleles', default=True) would be better? I think this is what click likes, and the style I've adopted elsewhere in the CLI.

https://click.palletsprojects.com/en/8.1.x/options/#boolean-flags

Comment on lines +500 to +515
# The last element of each sample's genotype indicates the phasing
# and is not an allele.
genotypes = variant.genotype.array()[:, :-1]
genotypes.clip(0, None, out=genotypes)
genotype_allele_counts = np.apply_along_axis(
np.bincount, axis=1, arr=genotypes, minlength=allele_count
)
allele_counts += genotype_allele_counts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

a = (n - b * (b + 1) / 2).astype(int)
else:
# TODO: Handle all possible ploidy
raise ValueError(f"Cannot handle ploidy = {ploidy}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to cover this case all right. Can we generate a simple triploid one-line VCF with a PL and index on the fly, like we do in the simulated tests?

We can log an issue as something to do in a follow up, if it's too messy though.

@@ -458,6 +490,87 @@ def sanitise_value_int_2d(buff, j, value):
buff[j, :, : value.shape[1]] = value


def compute_laa_field(variant) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a quick summary of what we're doing here, as the process is somewhat involved (necessarily). So, we're talking the local alleles to be anything that's in the genotypes, or has an allele depth of > 0, or is referenced in the PL field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I suppose that ideally, the code would infer which alleles are observed based on all the fields that can be localized. If we wanted to start localizing another genotype field in the future, we would need to update compute_laa_field to make sure that the LAA value contains all the alleles used in the genotype field.

@Will-Tyler
Copy link
Contributor Author

Were you planning on adding LPL in for this PR, or doing in a follow up? I'm happy either way.

In that case, I would like to add LPL in a follow-up PR. I think this one is large enough as is.

@jeromekelleher
Copy link
Contributor

Great, let's merge. Can you rebase please?

@Will-Tyler
Copy link
Contributor Author

Should be rebased now

@jeromekelleher jeromekelleher merged commit d1e3e09 into sgkit-dev:main Jul 10, 2024
11 checks passed
@Will-Tyler
Copy link
Contributor Author

Will-Tyler commented Jul 10, 2024 via email

@jeromekelleher
Copy link
Contributor

Thanks for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants