Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate determining coverage minimum and maximum coverage #15

Open
Takadonet opened this issue Oct 23, 2017 · 4 comments
Open

Automate determining coverage minimum and maximum coverage #15

Takadonet opened this issue Oct 23, 2017 · 4 comments
Assignees

Comments

@Takadonet
Copy link
Contributor

Instead of asking end user for these values, we should determine genome coverage based on the reads themselves.

@mgopez
Copy link
Member

mgopez commented Nov 8, 2017

Current Solution (working):

  1. Estimate the expected k-mer coverage depth through finding the maxima of the k-mer depth coverage values.

  2. Calculate the error rate using

num_kmers_appear_once / num_unique_kmers

We do it this way, as we use the assumption that almost every k-mer that appears exactly once is an error.

  1. Multiply error rate, and k-mer coverage depth to get the (slightly underestimated) expected k-mer coverage depth of errors.

  2. Use a Poisson distribution with expected coverage value, and k-mer coverage depth of errors to pull out the minimum depth to be confident that the observation is not entirely caused by errors.

  • Which is min_kmer_coverage.

@mgopez
Copy link
Member

mgopez commented Nov 8, 2017

We decided we are not going to implement a MAX kmer coverage auto as we don't have a good reason to.

@mgopez
Copy link
Member

mgopez commented Dec 19, 2017

Since there is a plan to remove the dependency of Jellyfish we can no longer calculate a minimum k-mer value. Leaving this issue up in the mean time, but no more progress is being made on this.

@peterk87
Copy link
Contributor

We can leave the Jellyfish dependency for now and merge the auto min kmer threshold system into bio_hansel with the caveat that in order to do the auto-kmer threshold, the user will need to run Jellyfish and be okay with the analyses taking longer and using more computational resources.

Automatically determining the min coverage depth could be useful for other applications like setting min coverage for some de novo assemblers, setting min freq for kmers when running Mash with reads, as well as for setting the min k-mer threshold for bio_hansel.

So the code you've written could be extracted into a separate package and implemented as a generic tool with a wrapper for Galaxy if there's a good use case for it, which I think there is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants