Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize repeatedly serializing array bitmaps #364

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

richardartoul
Copy link
Contributor

@richardartoul richardartoul commented Aug 20, 2022

Benchmark before

goos: darwin
goarch: amd64
pkg: github.com/RoaringBitmap/roaring
cpu: Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
BenchmarkRepeatedSparseSerialization-8   	 3732640	       316.4 ns/op	      96 B/op	       5 allocs/op
PASS
ok  	github.com/RoaringBitmap/roaring	1.849s

Benchmark after

goos: darwin
goarch: amd64
pkg: github.com/RoaringBitmap/roaring
cpu: Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
BenchmarkRepeatedSparseSerialization-8   	 6733712	       178.6 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/RoaringBitmap/roaring	1.760s

So a little less than 2x as fast, but zero allocation now as well.

@richardartoul richardartoul changed the title [Proposal] Optimize repeatedly serializing array bitmaps [Proposal] [Draft] Optimize repeatedly serializing array bitmaps Aug 20, 2022
roaringarray.go Outdated Show resolved Hide resolved
roaring.go Show resolved Hide resolved
@richardartoul richardartoul changed the title [Proposal] [Draft] Optimize repeatedly serializing array bitmaps Optimize repeatedly serializing array bitmaps Aug 24, 2022
@lemire
Copy link
Member

lemire commented Sep 7, 2022

Effectively, this is an attempt at doing your own memory allocation. Such approaches can help, in specific scenarios, but they can also cause harm and increase complexity.

There is a no free lunch property: trying to beat the Go memory allocator from Go itself involves tradeoffs which, on average, are going to leave you in a bad shape.

I don't think we want to go there generally. However, if you are interested in extending the roaring bitmaps, in such a way that current users are totally unaffected by your changes, then that would be acceptable.

So, for example, I don't think we want to fatten and complexify the type Bitmap. If we ever change our types, we will make them simpler and more lightweight, not the other way around.

If you want to create a separate Bitmap factory that could recycle old bitmaps, and produce new ones by trying to recycle existing ones, that is fine.

So instead of doing...

 	for i := 0; i < b.N; i++ {
 		l.ClearRetainStructures()
 		for j := 0; j < 8; j++ {
 			l.Add(uint32(j))
 		}
 		buf.Reset()
 		_, err := l.WriteTo(buf)
 		if err != nil {
 			panic(err)
 		}
 	}

You could try something like this... (this is not meant to be actual Go code)

     bf := NewBitmapFactory()
     for i := 0; i < b.N; i++ {
               content := [...]uint32{0,1,2,3,4,5,6,7}
 		l := bf.NewBitmapFrom(content)
 		buf.Reset()
 		_, err := l.WriteTo(buf)
 		if err != nil {
 			panic(err)
 		}
                bf.Recycle(l)
 	}
 }

So if you have to frequently create new small bitmaps from some content, then you could just store some containers in your factory and then produce a new bitmap from the available poll.

If there are bugs or inefficiencies, or complexity in a BitmapFactory instance, they would not affect the other users.

@richardartoul
Copy link
Contributor Author

richardartoul commented Sep 8, 2022

There is a no free lunch property: trying to beat the Go memory allocator from Go itself involves tradeoffs which, on average, are going to leave you in a bad shape

I don't really want to get into a big philosophical discussion about high performance Go software, but every major database that uses (or has used) this library (M3, Husky, InfluxDB, Pilosa, etc) takes extreme care to minimize unnecessary allocations in the critical path. Its trivial to beat the Go allocator, in Go, simply by virtue of understanding your application and leveraging tasteful amount of object pooling and reuse.

Leaning on the Go G.C for 95% of allocations is a great decision and a major reason why people are able to build large scale datasystems so quickly in Go, but most of those systems would be dead in the water performance wise if they didn't bypass the default allocator (at least to some degree) in their critical paths.

That said I am open to an alternatives implementations. The proposal you provided works for the narrow use-case in my original issue, but it doesn't remove the allocation caused by the call to WriteTo which would require exposing a new method that allows a reusable scratch buffer to be provided. Would you be open to that as well?

Alternatively, I think a solution that would work for a much wider variety of performance-critical use-cases would be to allow the bitmap types to be constructed with an optional "allocator". So something like this:

type Allocator interface {
    AllocateBytes(size, capacity int) []byte
    AllocateInt16s(size, capacity int) []int16
    ...
}

The existing Bitmaps would have a nil or defaultAllocator that just allocates like normal: return make([]byte, size, capacity) but it would allow performance-sensitive users to inject their own allocator as needed. So for my concrete example originally I would do something like this:

allocator := NewMyAllocator()
bf := roaring.NewBitmapWithAllocator(allocator)
for i := 0; i < b.N; i++ {
        for j := 0; j < 8; j++ {
                // Allocating the sparse container leverages the allocator.
                l.Add(uint32(j))
        }
       buf.Reset()
       // Allocating the temporary buffer in this method leverages the allocator also.
        _, err := l.WriteTo(buf)
        if err != nil {
                panic(err)
        }
}

The Bitmap code will need to call allocator.AllocateBytes() instead of allocating directly as a fairly trivial 1-line code change, but there will be no need to "return" buffers to the pool. It can be the callers responsibility to track which buffers have been assigned to which bitmaps and only allow them to be reused once those bitmaps are reset or no longer relevant.

I think that would strike a good balance between allowing callers to control allocations if they're willing to absorb the complexity in their application, while keeping additional complexity in the Bitmap implementation to a minimum (call a function to allocate instead of using make(), but thats it).

I went and looked through our codebase for the other areas where the allocations caused by this library are major sources of CPU and GC waste in our applications and almost all of them would be optimizeable to almost nothing with a custom allocator.

For example, another use-case we have is "compacting" together files that contain many of these bitmaps in an M:N fashion in a single-thread (sometimes compacting millions of bitmaps in a very short period of time). We suffer a lot of allocations from reading the M bitmaps in tight loops just to immediately throw them away after merging them into N output bitmaps. The allocator pattern would allow us to spawn an allocator for each input M that reuses across bitmaps and amortize the allocations significantly (probably 95%+ less allocations). Same for the outputs.

@richardartoul
Copy link
Contributor Author

@lemire to validate my suggestion, I made a quick draft: #366

Its not bad. Its still significantly better than not having a custom allocator, but it still suffers from allocating the &arrayContainer{}. I dont know if there is a good way to resolve that other than allowing the allocator to provide an entire container (but the implementations are private?) or allowing some of the *arrayContainers to be reused...

*arrayContainer could also be changed to arrayContainer so the methods are implemented using non-pointer receivers, but I think that might introduce the possibility of subtle bugs so maybe not wise.

Maybe its good enough. Let me know what you think about this approach!

@lemire
Copy link
Member

lemire commented Sep 8, 2022

Yes, the ability to provide a custom memory allocator is a reasonable path if it can be implemented in a reasonable manner. For that to be effective, you need to have both allocation and deallocation functions. One should be mindful that a container can belong to several bitmaps at once (due to 'copy on write'). Evidently, it would need to be safe in a multithreaded context (the Go garbage collector is thread safe). If you can address these concerns and get a performance boost in some cases, it would be great.

Note that they are related issues, e.g., #228

@richardartoul
Copy link
Contributor Author

@lemire I replied in the P.R draft, but I think my proposal works for copyOnWrite by default (any containers which must be copied will already be copied by the existing implementation). Thread-safety is an implementation detail to be handled by users who choose to write their own allocators (for my use-case for example, thread-safety is not a concern since I intend to spawn an allocator per-thread basically).

I don't think deallocation functions are required for many use-cases. I could add them to the interface if you like, but it will result in more changes and the P.R will end up looking more like my original P.R I think. I suspect that most use-cases can be addressed without deallocator functions. Some things would require a deallocator, but I wonder if it would be easier to start with a simpler pattern and expand the interface as we go?

@richardartoul
Copy link
Contributor Author

#228 is interesting yeah, although a large change. ~ the same thing could be accomplished by allowing the allocator to provide container wrappers FWIW.

@lemire
Copy link
Member

lemire commented Sep 8, 2022

@richardartoul I do not understand how you can safely reuse an array with a new bitmap without knowing that the array is now available.

@richardartoul
Copy link
Contributor Author

@lemire Let me give a few examples in pseudocode that I think are probably relevant for many different systems:

Ingestion

allocator := NewAllocator()
roaring := roaring.NewWithAllocator(allocator)

for _, buffer := range buffers {
    // This will make Bitmap release any references internally, and so
    // when the allocator provides them again, its fine.
    roaring.Reset()
    for _, v := range buffer {
        // Allocations caused by this method will grab from the allocator.
        roaring.Add(v)
    }
    // Once this method is complete, any datastructures taken from the
    // allocator are safe to be reused.
    roaring.WriteTo(stream)
}

Compaction

inputStreams := openInputStreams()
bitmaps := make([]*roaring.Bitmap,0,len(allocators)
for _, stream := range inputStreams {
    bitmaps = append(bitmaps, roaring.NewWithAllocator(custom.NewMyCustomAllocator())
}

output := roaring.NewWithAllocator(roaring.NewWithAllocator(custom.NewMyCustomAllocator())
for streamsAreNotEmpty() {
    for i, stream := range inputStreams {
        bitmaps[i].Reset()
        // Any containers that need to be allocated by this method can pull from the allocator
        // which basically never allocates, just returns objects from some "pool" it creates one time.
        bitmaps[i].ReadFrom(stream)
    }

    output.Reset()
    for _, bitmap := range bitmaps {
         // New containers allocated by this pull from allocator.
         output.UnionInPlace(bitmap)
    }

    // After this method completes we "know" its safe to reuse output and
    // its allocator (which always returns from the same pool of objects) which
    // allows us to eliminate allocations even though we never explicitly return
    // anything to the pool.
    //
    // After this method all the input bitmaps (and their corresponding allocators)
    // are safe to reuse also.
    output.WriteTo(outputStream)
}

There is a corresponding equivalent for query as well, but this is already a long comment.

@richardartoul
Copy link
Contributor Author

Basically I as the user/programmer know that at some point that the datastructures that were provided to the Bitmap by the allocator will never be touched again, even if they are never explicitly returned with a free() method I know that my allocator can confidently reuse them.

@richardartoul
Copy link
Contributor Author

I guess one thing I left out in my pseudocode is that the allocator would probably need some kind of Reset() method to indicate that the object it had given out were now free for reuse. But that does not need to be handled by the roaring library at all, I can add a Reset() method on my allocator and call it myself. Alternatively, the Clear() method in the Roaring library could call Rest() on the allocator if its defined.

@lemire
Copy link
Member

lemire commented Jan 23, 2023

Could other people chime in on this PR?

@damnever
Copy link
Contributor

I believe that internal pools may be necessary to maximize performance, and the bitmap should include a 'Reset' method to perform recycling tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants