Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we also extract the location of the unitigs? #35

Open
rickbeeloo opened this issue Sep 17, 2020 · 16 comments
Open

Can we also extract the location of the unitigs? #35

rickbeeloo opened this issue Sep 17, 2020 · 16 comments

Comments

@rickbeeloo
Copy link

I just took a look at the GFA output and noticed that, unlike tools such as seqwish, Bifrost does not output the paths corresponding to each node. Is it possible to obtain the locations of each of the unitigs in the original input sequences?

@GuillaumeHolley
Copy link
Collaborator

Right now, this is not available out of the box in the binary but I could implement that if there is a need for it.

@rickbeeloo
Copy link
Author

rickbeeloo commented Sep 17, 2020

I think it would be an awesome addition as this would allow us to quickly identify similar regions between genomes - like local MSAs - which for large sequence collections is infeasible without a graph-based approach. With the current implementation, we do not know the origins of the unitigs.

BlastFrost
Can't this be implemented based on BlastFrost as this does output coordinates? As I just see also suggested here: #3.

Pyfrost
I also took a look at pyfrost however I'm not sure whether it would be possible using this tool. pyfrost only seems to record the positions of the individual k-mers within the unitig rather than extend this to the boundaries of the unitig, thus like: unitig, seq_id, start, end

@lrvdijk
Copy link

lrvdijk commented Sep 18, 2020

That's a disadvantage of De Bruijn graphs in general, you lose that kind of navigational data to reconstruct the original sequences that went in.

Our group is planning to implement links 1 on top of Bifrost, which will help with that, expect early release end of this year, or maybe early next year.

@rickbeeloo
Copy link
Author

@lrvdijk aah interesting! Aren't the origins of the unitigs encoded in the color binary though? or is solely the presence of the individual k-mers within the untigs recorded?

@ekg
Copy link

ekg commented Sep 19, 2020

That's a disadvantage of De Bruijn graphs in general, you lose that kind of navigational data to reconstruct the original sequences that went in.

Our group is planning to implement links 1 on top of Bifrost, which will help with that, expect early release end of this year, or maybe early next year.

Nice! This will be really useful. I was wondering why the method hadn't caught on.

@GuillaumeHolley
Copy link
Collaborator

Hey everyone,

I'm adding the feature to my todo list :) To answer your question @rickbeeloo, only the presence/absence of the individual k-mers is recorded in the color file.

@rickbeeloo
Copy link
Author

Hi @GuillaumeHolley @lrvdijk , would it (for now) be possible to get the k-mer/unitig paths along the input genomes by querying the first part of the genome - let's say 5kb - and then traverse all edges (with the input genome color) till there are not edges left anymore (i.e. end of the genome)?

@ekg
Copy link

ekg commented Oct 2, 2020 via email

@rickbeeloo
Copy link
Author

@ekg Sorry I'm not sure what you mean with "reference graph"? We can get the unitgs from the Bifrost graph (e.g. via unitig-caller) and map them to the original input genome sequence(s) but you are talking about a graph?

@ekg
Copy link

ekg commented Oct 2, 2020

Sorry, I meant reference genome. (I'm used to working with reference graphs.)

@rickbeeloo
Copy link
Author

@ekg This indeed would work for large unitigs that can be unambiguously mapped - thus mostly highly conserved or accessory genes between a set of genomes. However, for genes in the middle (i.e. shared regions but variable parts) the unitigs will be short and linked via k-mers with different colors that cannot be unambiguously mapped to the reference genomes.

@bioinformagica
Copy link

Hello everyone, is this feature already implemented ? This would be very cool

@GuillaumeHolley
Copy link
Collaborator

As of now, this is not implemented in Bifrost but I was thinking more and more about doing it soon for reference sequences. Will push this on top of my todo list.

@bioinformagica
Copy link

Hi thanks for the quick reply, I'm really glad to hear it !

For a project I'm doing, I have to do a lot of gene alignments to create MSA gene reference plots using vg construct. My idea is to skip the slow process of doing gene alignments and create gfa files directly from the unaligned gene multifasta with bifrost build. But to do that, the final graph must have path information so I can do cool things like extract node depth, calculate distance between paths and create tables of present and absence of nodes.

@GuillaumeHolley
Copy link
Collaborator

Nice, glad to hear you have a cool project in mind. I started an implementation prototype in Bifrost and one question came up. According to the GFA1 spec, the Path line contains as 4th field an Optional comma-separated list of CIGAR strings which can just be a * (basically no CIGAR provided). Do you need the CIGAR strings? I could do it but it would make the everything a lot more complicated. I am just wondering if this would be needed for the common use case and would justify the extra computation time.

@bioinformagica
Copy link

Thanks !!

Nice, glad to hear you have a cool project in mind. I started an implementation prototype in Bifrost and one question came up. According to the GFA1 spec, the Path line contains as 4th field an Optional comma-separated list of CIGAR strings which can just be a * (basically no CIGAR provided). Do you need the CIGAR strings? I could do it but it would make the everything a lot more complicated.

No i don't need the cigar string, the * would be fine.

I am just wondering if this would be needed for the common use case and would justify the extra computation time.

Yeah maybe path info is not what most people want, maybe path info could be added with a optional --add-paths arg ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants