-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate genome size automatically #14
Comments
This is a good suggestion. One question I would have about this is how many extra parameters do you think this would be likely to add? i.e would I need a read technology-related flag, plus the ability for people to vary the kmer size? |
I would just try with
With |
Neat. Thanks for this @tseemann . I will add it to the features planned for version 0.2.0 |
So I've done a lot of reading about different methods for estimating genome size and I now feel more confused than I was beforehand. I played around with a few things but couldn't get anything working. The main issue stems from ploidy. Estimating genome size for haploids is much simpler than for non-hapolid and I don't want to support one form and not the other. I'm happy for someone to work on this if they like (I had the thought of using |
One potential approach here could be to enable providing a faidx index of the reference genome (instead of |
You mean just summing the lengths of all sequences in the index? |
Yes, that is the approach. |
@tomazberisa that should be easy enough to incorporate. See #31 |
In my assembler
shovill
i estimate the genome size from kmer frequencies and use that to subsample the reads to a fixed coverage (100x) much like rasusa does:https://github.com/tseemann/shovill/blob/master/bin/shovill#L145-L175
Would you consider adding
--genome-size auto
torasusa
?For nanopore data one would want k<=15 and for illumina higher is better, say 24-32.
For illumina, the number of kmers with freq > 2 is a good estimate normally.
Not sure about nanopore.
The text was updated successfully, but these errors were encountered: