Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vg call strips path/contig info from vcf #4445

Open
CormacKinsella opened this issue Nov 14, 2024 · 4 comments
Open

vg call strips path/contig info from vcf #4445

CormacKinsella opened this issue Nov 14, 2024 · 4 comments

Comments

@CormacKinsella
Copy link

1. What were you trying to do?

  • Produce a VCF from a mapped GAM
vg call graph.gbz --pack graph.pack --snarls graph.snarls --genotype-snarls --all-snarls --gbz-translation --gbz  > example.vcf

2. What did you want to happen?

  • I expected the VCF #CHROM column to retain the full contig/path name (which is in PanSN format), like the behaviour of vg deconstruct. e.g.:
#CHROM
simChimp#0#simChimp.chr6

3. What actually happened?

  • It stripped out the other info in the contig name, leaving the below
  • This means I can't pipe the vcf further into bcftools for normalisation vs the reference.fasta
#CHROM 
simChimp.chr6

5. What data and command can the vg dev team use to make the problem happen?

I did this using the simChimp example from Minigraph-Cactus, but I assume any gbz with PanSN contig naming.

6. What does running vg version say?

v1.61.0 "Plodio"
@CormacKinsella
Copy link
Author

I think this may be the same issue as #4442. I assumed I could run vg call without specifying a reference sample with -S (for a graph with only one ref sample), as according to the -p readme it should default to all reference paths

-p, --ref-path NAME Reference path to call on (multipile allowed. defaults to all paths)

Cheers for any advice!

@glennhickey
Copy link
Contributor

Yeah, it looks like vg call will only add the PANSN prefix if it thinks there can be ambiguity between different samples in the VCF. Probably a good idea to add an option (like deconstrut) to let the user force the issue, but in the meantime you'll have to use sed or something like that to add it yourself...

@CormacKinsella
Copy link
Author

Thanks Glenn! That sounds like it would be pretty useful feature, as I'm not sure how to force sample ambiguity when there will only be one mapped sample handled per task.

In the meantime I will simplify the contig naming in the FASTA to only the contig name.

Cheers for your help

@CormacKinsella
Copy link
Author

P.S. I realise my suggestion wouldn't work as of course the PanSN notation is recreated in the pipeline - I'll do sed on the reference FASTA as you suggested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants