Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

created a super minimal mash function for sketching sequences. #344

Merged
merged 15 commits into from
Oct 5, 2023

Conversation

TimothyStiles
Copy link
Collaborator

This package is meant to help create "sketches" via the mash sketching algorithm. There's plenty of room for optimization here but I've implemented the core idea.

Mash: fast genome and metagenome distance estimation using MinHash.

Ondov, B.D., Treangen, T.J., Melsted, P. et al.
Genome Biol 17, 132 (2016).
https://doi.org/10.1186/s13059-016-0997-x

Mash Screen: high-throughput sequence containment estimation for genome discovery.

Ondov, B., Starrett, G., Sappington, A. et al.
Genome Biol 20, 232 (2019).
https://doi.org/10.1186/s13059-019-1841-x

mash/mash.go Outdated
Comment on lines 16 to 17
The idea is that we can sketch a sequence of data, and then compare the sketch to other sketches to see how similar they are.
This saves a bunch of computation time and memory, because we don't have to compare the entire sequence to another sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have some context on how this differs from algorithms like minimap2. Computation upfront?

mash/mash.go Outdated

import "github.com/spaolacci/murmur3"

// sketch algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have more context for this. Where is the sketch algorithm?

Comments should go up to 80char

For this long of multiline, especially if just context, should have a multiline comment style

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have multiline comments that start each line with //, and IMO we should standardize this across the repo to be one way or the other - not a huge deal though.

mash/mash.go Outdated Show resolved Hide resolved
mash/mash.go Show resolved Hide resolved
mash/mash.go Outdated
}
}

func (m *Mash) Sketch(sequence string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the NewMash function? If so, a function description string would be nice

mash/mash.go Outdated Show resolved Hide resolved
mash/mash.go Outdated Show resolved Hide resolved
mash/mash.go Outdated
Comment on lines 90 to 92
// for each kmer in the window, hash it to a 32 bit number
// keep the minimum hash value of all the kmers in the window up to a given sketch size
// the sketch is a vector of the minimum hash values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Comment on lines 9 to 20
func TestMash(t *testing.T) {
fingerprint1 := mash.NewMash(17, 10)
fingerprint1.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

fingerprint2 := mash.NewMash(17, 10)
fingerprint2.Sketch("ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA")

distance := fingerprint1.Distance(fingerprint2)
if distance != 0 {
t.Errorf("Expected distance to be 0, got %f", distance)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for affirming the algorithm actually works on more complicated sequences or in the null case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Any ideas for a good test case @Koeng101?

mash/mash.go Outdated Show resolved Hide resolved
@carreter carreter added the draft label Sep 16, 2023
@Koeng101
Copy link
Contributor

7zdkzz

@TimothyStiles
Copy link
Collaborator Author

adding the help wanted tag because @Koeng101 wants this merged but the following still needs to be done and I won't be able to get around to it for at least a week.

  1. Keep hashes sorted for faster search
  2. Figure out where this belongs on the package level e.g seqhash or maybe a new search package?
  3. Better comments and docs in general
  4. 100% test coverage and more robust test case.

@TimothyStiles TimothyStiles added the help wanted Extra attention is needed label Sep 21, 2023
@carreter carreter removed the draft label Sep 23, 2023
@carreter
Copy link
Collaborator

Linking this to #356

Copy link
Contributor

@soypat soypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've suggested a radically different API for performance and future proof reasons. Let me know what y'all think.

mash/mash.go Outdated Show resolved Hide resolved
mash/mash.go Outdated
Sketches []uint32
}

func NewMash(kmerSize uint, sketchSize uint) *Mash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that this API could be simplified quite a lot and be made more direct and performant:

// WriteSketches writes `n` sketches into dstSketches by splitting sequence
// into chunks of `kmerSize`.
func WriteSketches(dstSketches []uint32, sequence string, kmerSize int) (n int, _ error) {
 
} 

This would also give us leeway by not exposing leaky abstractions in case in the future we realize we want to be able to configure how we write our Sketches with a Masher/Mash type which takes several configuration parameters which may not be apparent to us today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd avoid uint in size definitions to catch programmer errors early. Having sizes as uint can do more harm than good i.e:

sketchSize := uint(len(sketchPool) - totalProcessed)

In the code above the user calculates the sketch size from a pair of ints. If the user is mistaken and the result of the calculation is negative then uint converts the negative number into a massively huge number which would cause a out of memory panic.

Compared to negative size panics the OOM panic can affect other goroutines running in the same program.

mash/mash.go Outdated
// count the number of hashes that are the same
// divide the number of hashes that are the same by the total number of hashes
// the result is the distance between the two sequences
func (m *Mash) Distance(other *Mash) float64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above change Distance could be rewritten as

func Distance(a, b []uint32) float64

mash/mash.go Outdated
func (m *Mash) Distance(other *Mash) float64 {
var sameHashes int
for i := 0; i < len(m.Sketches); i++ {
for j := 0; j < len(other.Sketches); j++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested loop could be changed for two separate loops, one where the shorter sketch slice is stored into a map[uint32]struct{} type and the other where they are checked to see if other long uint32s slice contains entries which are in the map. Would speed up for large sketches.

@Koeng101
Copy link
Contributor

I really like @soypat changes in this case. They make a lot of sense to me

@TimothyStiles TimothyStiles mentioned this pull request Sep 27, 2023
@TimothyStiles TimothyStiles marked this pull request as ready for review October 5, 2023 22:37
@TimothyStiles TimothyStiles merged commit 2271b5d into main Oct 5, 2023
@delete-merged-branch delete-merged-branch bot deleted the minhash-forest branch October 5, 2023 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants