Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bit_kmers function will panic at 'attempt to multiply with overflow', if k-mer length is longer than 32 bp. #58

Open
yiolino opened this issue Mar 29, 2022 · 7 comments

Comments

@yiolino
Copy link

yiolino commented Mar 29, 2022

Thank you for great software.

I want to use bit_kmer function to count k-mer in fastq file.
But, if k-mer length is longer than 32 bases, it will panic.

I think this is because the bit_kmer sequence is represented as a u64 type (type BitKmerSeq = u64), but is there a method to perform k-mer counts over 32 bp?

I am using HashMap as a database for k-mer counts, but if I use fastq files as input, HashMap becomes too large.
So, I would like to use bit_kmer to reduce it as much as possible. Is there an alternative method that could be considered, such as "use u128 type"?

Regards,
tetsuro90

@yiolino yiolino changed the title bit_kmers function will panick at 'attempt to multiply with overflow', if k-mer length is longer than 32 bp. bit_kmers function will panic at 'attempt to multiply with overflow', if k-mer length is longer than 32 bp. Mar 29, 2022
@Keats
Copy link
Contributor

Keats commented Mar 29, 2022

Hi,

We probably are going to revamp the bit kmers we have as we just built a pretty huge project using needletail and ended up re-implementing the bit encoding to be more flexible.
You're right that it would need u128 instead of u64 for storing kmers > 31bp. For now I would recommend writing your own 2bit encoding function returning a u128 and do not use the built-in bit kmers iterator.

@yiolino
Copy link
Author

yiolino commented Mar 29, 2022

@Keats
Thank you for your reply.

I understand and looking forward to re-implementing the bit encoding!

Regards,

@natir
Copy link

natir commented Mar 29, 2022

You can check kmers bit encoding https://github.com/COMBINE-lab/kmers

@Keats
Copy link
Contributor

Keats commented Mar 29, 2022

I will have to take a look at that @natir !
In our program we have some stuff that wouldn't make sense in a public library (eg we do some 3 bit encoding as well to encode ATCG$). I'm leaning toward adding some basic encoding utilities to needletail and let people do whatever they need on the raw sequence rather than having a built-in opinionated iterator.

@yiolino
Copy link
Author

yiolino commented Mar 29, 2022

@natir
I'll use that. Thank you!

@natir
Copy link

natir commented Mar 29, 2022

Sure @Keats my message is more for @tetsuro90 than you, maybe kmers match to @tetsuro90 requirement.

@Keats
Copy link
Contributor

Keats commented Mar 29, 2022

Yeah I understood, I just meant that I ned to look at kmers before doing changes to needletail to see if we can consolidate somehow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants