Estimate genome size automatically #14

tseemann · 2019-11-24T22:15:01Z

In my assembler shovill i estimate the genome size from kmer frequencies and use that to subsample the reads to a fixed coverage (100x) much like rasusa does:

https://github.com/tseemann/shovill/blob/master/bin/shovill#L145-L175

Would you consider adding --genome-size auto to rasusa ?

For nanopore data one would want k<=15 and for illumina higher is better, say 24-32.
For illumina, the number of kmers with freq > 2 is a good estimate normally.
Not sure about nanopore.

The text was updated successfully, but these errors were encountered:

mbhall88 · 2019-11-25T13:42:16Z

This is a good suggestion.

One question I would have about this is how many extra parameters do you think this would be likely to add? i.e would I need a read technology-related flag, plus the ability for people to vary the kmer size?

tseemann · 2019-11-26T02:37:54Z

I would just try with k=15 and see how it goes. For illumina data i would just use R1 and ignore R2 as it is noisier and won't add any information.

# actual genome size = 6092523
# R1 of illumina 150 PE data

% mash sketch -m 3 -k {13 15 17 19 21 25 29 32}

Estimated genome size: 5.26142e+06
Estimated genome size: 5.87121e+06  k=15
Estimated genome size: 5.91182e+06
Estimated genome size: 6.31256e+06 k = 19
Estimated genome size: 5.99984e+06
Estimated genome size: 6.54007e+06 k = 25
Estimated genome size: 6.19869e+06
Estimated genome size: 6.04152e+06 k=32

With -m 2 you get larger estimates. But usually all within 10%

mbhall88 · 2019-11-26T11:28:53Z

Neat. Thanks for this @tseemann . I will add it to the features planned for version 0.2.0

mbhall88 · 2020-10-14T04:46:42Z

So I've done a lot of reading about different methods for estimating genome size and I now feel more confused than I was beforehand. I played around with a few things but couldn't get anything working. The main issue stems from ploidy. Estimating genome size for haploids is much simpler than for non-hapolid and I don't want to support one form and not the other.

I'm happy for someone to work on this if they like (I had the thought of using finch-rs for the k-mer work) so I will leave this open - but I am unlikely to come back to this unless it becomes crucial for my own research. Requiring the genome size from the user doesn't seem too big an ask for this particular task.

tomazberisa · 2021-08-20T10:08:51Z

One potential approach here could be to enable providing a faidx index of the reference genome (instead of --genome-size) and use it to calculate genome size.

mbhall88 · 2021-08-22T23:55:22Z

You mean just summing the lengths of all sequences in the index?

tomazberisa · 2021-08-23T10:03:12Z

Yes, that is the approach.

mbhall88 · 2021-08-24T01:37:45Z

@tomazberisa that should be easy enough to incorporate. See #31

mbhall88 added the enhancement New feature or request label Nov 25, 2019

mbhall88 added this to the Release version 0.2.0 milestone Nov 26, 2019

mbhall88 modified the milestones: Release version 0.2.0, Release 0.3.0 Mar 27, 2020

mbhall88 removed this from the Release 0.3.0 milestone Oct 14, 2020

mbhall88 added help wanted Extra attention is needed wontfix This will not be worked on labels Oct 14, 2020

mbhall88 mentioned this issue Aug 24, 2021

Calculate genome size from an index file #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate genome size automatically #14

Estimate genome size automatically #14

tseemann commented Nov 24, 2019

mbhall88 commented Nov 25, 2019

tseemann commented Nov 26, 2019 •

edited

Loading

mbhall88 commented Nov 26, 2019

mbhall88 commented Oct 14, 2020

tomazberisa commented Aug 20, 2021

mbhall88 commented Aug 22, 2021

tomazberisa commented Aug 23, 2021

mbhall88 commented Aug 24, 2021

Estimate genome size automatically #14

Estimate genome size automatically #14

Comments

tseemann commented Nov 24, 2019

mbhall88 commented Nov 25, 2019

tseemann commented Nov 26, 2019 • edited Loading

mbhall88 commented Nov 26, 2019

mbhall88 commented Oct 14, 2020

tomazberisa commented Aug 20, 2021

mbhall88 commented Aug 22, 2021

tomazberisa commented Aug 23, 2021

mbhall88 commented Aug 24, 2021

tseemann commented Nov 26, 2019 •

edited

Loading