-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we also extract the location of the unitigs? #35
Comments
Right now, this is not available out of the box in the binary but I could implement that if there is a need for it. |
I think it would be an awesome addition as this would allow us to quickly identify similar regions between genomes - like local MSAs - which for large sequence collections is infeasible without a graph-based approach. With the current implementation, we do not know the origins of the unitigs. BlastFrost Pyfrost |
That's a disadvantage of De Bruijn graphs in general, you lose that kind of navigational data to reconstruct the original sequences that went in. Our group is planning to implement links 1 on top of Bifrost, which will help with that, expect early release end of this year, or maybe early next year. |
@lrvdijk aah interesting! Aren't the origins of the unitigs encoded in the color binary though? or is solely the presence of the individual k-mers within the untigs recorded? |
Nice! This will be really useful. I was wondering why the method hadn't caught on. |
Hey everyone, I'm adding the feature to my todo list :) To answer your question @rickbeeloo, only the presence/absence of the individual k-mers is recorded in the color file. |
Hi @GuillaumeHolley @lrvdijk , would it (for now) be possible to get the k-mer/unitig paths along the input genomes by querying the first part of the genome - let's say 5kb - and then traverse all edges (with the input genome color) till there are not edges left anymore (i.e. end of the genome)? |
You could also map the unitigs back to your reference graph. This should
tell you their locations.
…On Fri, Oct 2, 2020 at 12:23 PM rickbeeloo ***@***.***> wrote:
Hi @GuillaumeHolley <https://github.com/GuillaumeHolley> @lrvdijk
<https://github.com/lrvdijk> , would it (for now) be possible to get the
k-mer/unitig paths along the input genomes by querying the first part of
the genome - let's say 5kb - and then traverse all edges (with the input
genome color) till there are not edges left anymore (i.e. end of the
genome)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEIUU6Q2XSDFWNR3F5TSIWSZBANCNFSM4RQ4HRJQ>
.
|
@ekg Sorry I'm not sure what you mean with "reference graph"? We can get the unitgs from the Bifrost graph (e.g. via |
Sorry, I meant reference genome. (I'm used to working with reference graphs.) |
@ekg This indeed would work for large unitigs that can be unambiguously mapped - thus mostly highly conserved or accessory genes between a set of genomes. However, for genes in the middle (i.e. shared regions but variable parts) the unitigs will be short and linked via k-mers with different colors that cannot be unambiguously mapped to the reference genomes. |
Hello everyone, is this feature already implemented ? This would be very cool |
As of now, this is not implemented in Bifrost but I was thinking more and more about doing it soon for reference sequences. Will push this on top of my todo list. |
Hi thanks for the quick reply, I'm really glad to hear it ! For a project I'm doing, I have to do a lot of gene alignments to create MSA gene reference plots using vg construct. My idea is to skip the slow process of doing gene alignments and create gfa files directly from the unaligned gene multifasta with bifrost build. But to do that, the final graph must have path information so I can do cool things like extract node depth, calculate distance between paths and create tables of present and absence of nodes. |
Nice, glad to hear you have a cool project in mind. I started an implementation prototype in Bifrost and one question came up. According to the GFA1 spec, the Path line contains as 4th field an |
Thanks !!
No i don't need the cigar string, the * would be fine.
Yeah maybe path info is not what most people want, maybe path info could be added with a optional --add-paths arg ? |
I just took a look at the GFA output and noticed that, unlike tools such as seqwish, Bifrost does not output the paths corresponding to each node. Is it possible to obtain the locations of each of the unitigs in the original input sequences?
The text was updated successfully, but these errors were encountered: