-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local alleles #266
Local alleles #266
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start, good job! Some questions:
- Why do we just count the 1-based ALT alleles? It's the same thing to count all alleles, zero based isn't it?
- See the question about AD
- I'd like to see some non-trivial test data. Would you mind adding an example file based on the SVCR paper example? (Note that the
du
test will break when you add a new file - this is a bit clunky, sorry) - I think the application to PLs is the important thing here. Otherwise, we don't really care about creating LAA at all, so I think we should try adding in LPL pretty soon so we're sure we're going in the right direction.
bio2zarr/vcf2zarr/icf.py
Outdated
laa_val = [set() for _ in range(sample_count)] | ||
|
||
if "GT" in variant.FORMAT: | ||
for sample_index, genotype in enumerate(variant.genotypes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be too slow for production use I'm afraid - we can't have any Python loops over the samples, which is O(10^6). Can we do the same thing using numpy operations?
For example, would something like np.bincount(variant.genotypes[some_selector])
do the trick?
Also, might as well count the reference alleles as well, as we're at it?
bio2zarr/vcf2zarr/icf.py
Outdated
] | ||
laa_val = np.array(laa_val) | ||
tcw.append("FORMAT/LAA", laa_val) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style thing here, but I'd rather use an explicit else
on the other case than a continue
here. I tend to use these sparingly, and for "early exit" close to the top of the loop
tests/test_vcf_examples.py
Outdated
@@ -531,6 +545,36 @@ def test_call_AD(self, ds): | |||
] | |||
nt.assert_array_equal(ds.call_AD.values, call_AD) | |||
|
|||
def test_call_LAA(self, ds): | |||
# The shape is (23, 3, 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well use np.full((23, 3, 1), -2, dtype=np.int32)
here then
Thanks for the feedback! Let me try to answer your questions here.
I assume we want the result of localizing genotype fields in vcf2zarr to be the same as if the genotype fields were already localized in the input VCF file according to the VCF specification. Therefore, the localized genotype fields produced by vcf2zarr should also follow the VCF specification. The VCF specification dictates that if any local-allele genotype field is present, then the LAA field must be present. The LAA field is by definition a list of the one-based indices into the ALT variant field that indicates which alleles are observed in the sample at the variant. I think staying consistent with the VCF specification helps keep the VCF Zarr representation simple. For example, if VCF Zarr data consumers encounter a call_LPL field, then they know to use the call_LAA field to determine which genotypes the values of the call_LPL field correspond to.
The AD field might have positive values for alleles that are not called in the GT field. This is demonstrated in the LAA example in the v4.5 specification (page 14) reproduced here:
I think inferring which alleles are observed with the PL field as well as you suggested is a good idea. I will look into this.
I am happy to add more tests—thanks for the suggestion!
Good to know—I wanted to create a pull request after implementing the LAA field to get feedback, make sure I was on the right track, and keep the pull request size manageable. I will modify these changes so that vcf2zarr does not introduce local-allele fields unless the PL field is present in the input VCF. Once the LAA field changes look good, I can add the LPL field either as part of this pull request or a new pull request. As I was thinking about this, I realized that vcf2zarr should not compute and add the LAA field if the LAA field is already present in the input VCF. I need to make some changes for that. |
I made some changes, but I'm not quite ready for a new review. I want to try simplifying the |
a = (n - b * (b + 1) / 2).astype(int) | ||
else: | ||
# TODO: Handle all possible ploidy | ||
raise ValueError(f"Cannot handle ploidy = {ploidy}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is causing the coverage to fail. Let me know if you want me to add a unit test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to cover this case all right. Can we generate a simple triploid one-line VCF with a PL and index on the fly, like we do in the simulated tests?
We can log an issue as something to do in a follow up, if it's too messy though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @Will-Tyler, I think you've basically nailed it. Some minor comments.
Were you planning on adding LPL in for this PR, or doing in a follow up? I'm happy either way.
bio2zarr/cli.py
Outdated
@@ -149,6 +149,14 @@ def list_commands(self, ctx): | |||
help="An approximate bound on overall memory usage (e.g. 10G),", | |||
) | |||
|
|||
local_alleles = click.option( | |||
"--local-alleles", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an explicit flag like @click.option('--local-alleles/--no-local-alleles', default=True)
would be better? I think this is what click likes, and the style I've adopted elsewhere in the CLI.
https://click.palletsprojects.com/en/8.1.x/options/#boolean-flags
# The last element of each sample's genotype indicates the phasing | ||
# and is not an allele. | ||
genotypes = variant.genotype.array()[:, :-1] | ||
genotypes.clip(0, None, out=genotypes) | ||
genotype_allele_counts = np.apply_along_axis( | ||
np.bincount, axis=1, arr=genotypes, minlength=allele_count | ||
) | ||
allele_counts += genotype_allele_counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
a = (n - b * (b + 1) / 2).astype(int) | ||
else: | ||
# TODO: Handle all possible ploidy | ||
raise ValueError(f"Cannot handle ploidy = {ploidy}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to cover this case all right. Can we generate a simple triploid one-line VCF with a PL and index on the fly, like we do in the simulated tests?
We can log an issue as something to do in a follow up, if it's too messy though.
@@ -458,6 +490,87 @@ def sanitise_value_int_2d(buff, j, value): | |||
buff[j, :, : value.shape[1]] = value | |||
|
|||
|
|||
def compute_laa_field(variant) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a quick summary of what we're doing here, as the process is somewhat involved (necessarily). So, we're talking the local alleles to be anything that's in the genotypes, or has an allele depth of > 0, or is referenced in the PL field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. I suppose that ideally, the code would infer which alleles are observed based on all the fields that can be localized. If we wanted to start localizing another genotype field in the future, we would need to update compute_laa_field
to make sure that the LAA value contains all the alleles used in the genotype field.
In that case, I would like to add LPL in a follow-up PR. I think this one is large enough as is. |
Great, let's merge. Can you rebase please? |
Only add LAA field if not present Infer observed alleles from PL field
a8ca97d
to
394b2ec
Compare
Should be rebased now |
Thank you for reviewing!
… Le 10 juil. 2024 à 20:09, Jerome Kelleher ***@***.***> a écrit :
Merged #266 <#266> into main.
—
Reply to this email directly, view it on GitHub <#266 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFTQUPJ4UGZBQ7MRGKUVBD3ZLWIHLAVCNFSM6AAAAABKKA6WJ2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGQ3DCNJRG43TSNY>.
You are receiving this because you were mentioned.
|
Thanks for contributing! |
Description
In this pull request, vcf2zarr computes a genotype-level, local alleles field,
LAA
, during the explode step unless specifically told not to. TheLAA
field is a list of one-based indices into the variant-levelALT
field that indicates which alleles are relevant (local) for the current sample. The source of truth for theLAA
field is the GT, AD, and PL genotype fields.vcf2zarr does not try to compute the LAA field if the LAA field is already present in a genotype entry.
With these changes, vcf2zarr computes the
LAA
field by default. To prevent vcf2zarr from introducing the LAA field, the user can specify--local-alleles false
when running the vcf2zarr explode CLI.This pull request makes progress on #185.
Testing
I updated and added several unit tests. I am happy to add more tests.
I was not able to use the bcftools tag2tag plugin as discussed in #185 because generating local-allele fields is not actually implemented in the plugin yet. (See here.)
Discussion
Format of the
call_LAA
fieldTo simplify the format of the
call_LAA
field, vcf2zarr always assigns three-dimensional values to theLAA
field, filling space with the fill value,-2
, where necessary. If none of the entries use alternate alleles, then each entry'sLAA
is[-2]
. When an entry does use alternate alleles, the alternate allele indices come first in sorted order followed by any necessary fill values. For example,[2, 5, 6, -2, -2]
is a validLAA
value.Deciding when to localize
The issue (#185) seems to focus on the PL field. vcf2zarr only introduces the local-allele fields if the
PL
field is present in the VCF file, and the user has not specified to not introduce local-allele fields.Inferring which alleles are used in the PL field
When the ploidy is one, the index of the PL corresponds to the index of the allele.
When the ploidy is two, then the index of the PL value corresponds to a diploid genotype. Given a positive PL value, we want to determine which alleles are in the genotype associated with the PL value. Given a genotype$a/b$ , the index of the associated PL value is $$\frac{b(b+1)}{2} + a.$$
Suppose the index of the positive PL value is$n$ . To determine $b$ , we can momentarily set $a = b$ and solve $$n = \frac{b(b+1)}{2} + b.$$
We have$$2n = b^2 + 3b$$ $$\implies 2n + \frac{9}{4} = b^2 + 3b + \frac{9}{4}$$ $$\implies 2n + \frac{9}{4} = \left( b + \frac{3}{2} \right)^2$$ $$\implies \sqrt{2n + \frac{9}{4}} = b + \frac{3}{2}$$ $$\implies b = \sqrt{2n + \frac{9}{4}} - \frac{3}{2}.$$
Since$a$ can be less than $b$ , we take the ceiling: $$b = \left\lceil \sqrt{2n + \frac{9}{4}} - \frac{3}{2} \right\rceil .$$
Determining$a$ is then simple: $$a = n - \frac{b(b+1)}{2}.$$
References
LAA
field as well as the other local-allele genotype fields. The latest version, v4.5, was apparently finalized and published last week.