-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVCR Local alleles #185
Comments
I am interested in working on this task! I looked into this some—the local alleles approach makes sense to me. For each variant site, we add a list of global alleles, which are all the possible alleles seen in the samples at that site. For each variant and sample, we add a field called Some of the datasets in the unit tests in I assume the VCF Zarr spec should be updated if this approach is implemented. |
That would be AMAZING @Will-Tyler! I think the best first step would be to work up an example based on Figure 3b and 3c in the preprint. We would just want the three sites in the original VCF, and use the local alleles values (and LPL) from that example. Maybe a good start here would be to write out the values for the last variant (POS=3350) explicitly here, so we can understand what's going on, and then use these as the basis for a test case afterwards? |
Thanks for the suggestion on how to get started. The PVCF representation in the example (3b) is
The last variant site in the SVCR representation (3c) is
Using just the LA and LPL values (and not using the reference blocks), the representation becomes
My understanding is that the LA field is the indices of the alleles with positive, nonzero AD values, which in most cases will be two alleles for diploid chromosomes but could technically be more. In other words, the source of truth for the LA field is the AD field. If the source of truth for the LA field was the GT field, which contains at most two alleles, then it seems like the LA field would be a redundant form of the GT field? |
The local alleles approach was incorporated into the draft VCF 4.5 specification in this pull request. The draft specification mandates that if any local-allele fields are present, then the LAA field must be present. The LAA field is a list of the one-based indices of the alternate alleles observed in the sample. It seems to me like a good next step would be to add functionality to bio2zarr so that bio2zarr infers which alternate alleles are observed in a sample and creates an LAA field during the explode step. bio2zarr can infer which alternate alleles are observed in a sample based on the GT and AD fields for that sample. |
Note that bcftools contains a tag2tag plugin that lets the user interconvert between global fields and local-allele fields in VCF files. With tag2tag being available, is it still desired to implement a local-alleles approach in bio2zarr? Or would it suffice to convert the input VCF file to local-allele fields using bcftools before using bio2zarr? If this functionality is needed in bio2zarr, should it be its own command (for example, |
This is super helpful, thanks @Will-Tyler! Here's how I think it should work. By default, if we see a PL field in the VCF (maybe later I guess any other Number=G fields) we create The reason for doing it this way is that the For testing, we results of running our code on the original VCF should be identical to the results of Does this make sense? |
Overall, I think this makes sense and gives me good direction. When I was looking into this though, I thought it would be ideal to implement the global-to-local conversion during the The testing approach is a good idea—thanks for this suggestion! |
Actually you're right, it should be done during explode. Are you fairly clear on next steps or would you like some pointers? You've picked up the code base very quickly, I'm impressed! |
Thanks! 😄 I think I should be able to make progress from here. My plan is to start by creating the |
We need to implement the "local alleles" approach to deal with overblown PL fields.
See discussion: sgkit-dev/vcf-zarr-publication#5
The text was updated successfully, but these errors were encountered: