Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new iterator to skip "bad" bases without canonicalization #40

Open
apcamargo opened this issue Feb 8, 2020 · 3 comments
Open

Comments

@apcamargo
Copy link

Hi Roderick!

I'm developing a k-mer counting Python package for internal usage and I'm using needletail as a backend. While developing it, I noticed that Kmers and CanonicalKmers are inconsistent regarding non-ATCG characters. While Kmers count them, they are skipped by CanonicalKmers (understandably so).

Because of that, my function only uses CanonicalKmers even when counting non-canonical k-mers (I just reverse complement the sequence if canonical boolean is true), which causes additional computational burden.

I don't know if this decision was made by design, but maybe Kmers should include an argument that allows the user to choose whether non-ATCG characters should be ignored.

Thank you for all your work in needletail!

@bovee
Copy link
Contributor

bovee commented Feb 12, 2020

Hi Antônio! You're welcome and thanks for opening this issue.

I think you're right that there's a gap between CanonicalKmers (which fits some of our specific use-cases) and Kmers (which is a much more general method that can feed into many different bioinformatics use-cases). I would hesitate to change the API on Kmers and I think CanonicalKmers is also a bit tricky to change because of how the reverse complement is passed as an argument, but I think adding another iterator would be completely appropriate?

Perhaps a SkipKmers struct with a new(buffer: &'a [u8], strict: bool, nucleic: bool, k: u8) method and Iterator::Item is (usize, &'a [u8]) would be the right API? strict could act as a switch for whether to allow non-ACGT IUPAC bases (maybe it should be called iupac instead?) and nucleic would determine which alphabet to use (nucleic vs amino acid).

@apcamargo
Copy link
Author

Hi @bovee

Yes, I think a SkipKmers with the options you described would pair nicely with CanonicalKmers without changing the current API.

I'm unsure whether nucleic is a good name for an option that may trigger amino acid k-mer counting. Maybe alphabet would be a better option, but then it wouldn't make sense for it to be a bool.

@bovee
Copy link
Contributor

bovee commented Feb 12, 2020

I like the idea of an alphabet parameter replacing the two above, but I agree that it would make sense to introduce some kind of Enum for that. I've avoided that previously because that kind of construct can devolve into kind of a mess (e.g. BioPython's: https://biopython.org/DIST/docs/api/Bio.Alphabet-module.html ), but if the only values were Nucleic, IupacNucleic, and AminoAcid (with maybe something around gapped or DNA/RNA specific alphabets if necessary) I think that'd be reasonable.

@bovee bovee changed the title Inconsistency between Kmers and CanonicalKmers Add a new iterator to skip "bad" bases without canonicalization Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants