Optimize fenwick tree manipulations #23

vmihailenco · 2018-11-01T12:34:28Z

There are 2 separate changes:

Create bigger initial fenwick tree so we don't need to rebuild it on every add

benchmark                             old ns/op     new ns/op     delta
BenchmarkAddMultipleTimes/10-4        6903          11206         +62.34%
BenchmarkAddMultipleTimes/100-4       100042        87661         -12.38%
BenchmarkAddMultipleTimes/1000-4      1581666       1184668       -25.10%
BenchmarkAddMultipleTimes/10000-4     6562377       5292517       -19.35%

benchmark                             old allocs     new allocs     delta
BenchmarkAddMultipleTimes/10-4        35             7              -80.00%
BenchmarkAddMultipleTimes/100-4       305            7              -97.70%
BenchmarkAddMultipleTimes/1000-4      1400           7              -99.50%
BenchmarkAddMultipleTimes/10000-4     2285           7              -99.69%

benchmark                             old bytes     new bytes     delta
BenchmarkAddMultipleTimes/10-4        13696         28816         +110.40%
BenchmarkAddMultipleTimes/100-4       100096        28816         -71.21%
BenchmarkAddMultipleTimes/1000-4      1878176       28816         -98.47%
BenchmarkAddMultipleTimes/10000-4     4976000       28816         -99.42%

When number of adds is small the benchmark becomes slower because tree is bigger and number of rebuilds is not high. But as number of adds grows benchmark improves a lot.

Fork fenwick tree implementation to use uint32 instead of int64 to avoid constructing extra array to pass it to the fenwick lib

benchmark                             old ns/op     new ns/op     delta
BenchmarkAddMultipleTimes/10-4        11206         8562          -23.59%
BenchmarkAddMultipleTimes/100-4       87661         74857         -14.61%
BenchmarkAddMultipleTimes/1000-4      1184668       1104090       -6.80%
BenchmarkAddMultipleTimes/10000-4     5292517       5050204       -4.58%

benchmark                             old allocs     new allocs     delta
BenchmarkAddMultipleTimes/10-4        7              6              -14.29%
BenchmarkAddMultipleTimes/100-4       7              6              -14.29%
BenchmarkAddMultipleTimes/1000-4      7              6              -14.29%
BenchmarkAddMultipleTimes/10000-4     7              6              -14.29%

benchmark                             old bytes     new bytes     delta
BenchmarkAddMultipleTimes/10-4        28816         16528         -42.64%
BenchmarkAddMultipleTimes/100-4       28816         16528         -42.64%
BenchmarkAddMultipleTimes/1000-4      28816         16528         -42.64%
BenchmarkAddMultipleTimes/10000-4     28816         16528         -42.64%

The cumulative result of 2 commits is following:

benchmark                             old ns/op     new ns/op     delta
BenchmarkAddMultipleTimes/10-4        6903          8562          +24.03%
BenchmarkAddMultipleTimes/100-4       100042        74857         -25.17%
BenchmarkAddMultipleTimes/1000-4      1581666       1104090       -30.19%
BenchmarkAddMultipleTimes/10000-4     6562377       5050204       -23.04%

benchmark                             old allocs     new allocs     delta
BenchmarkAddMultipleTimes/10-4        35             6              -82.86%
BenchmarkAddMultipleTimes/100-4       305            6              -98.03%
BenchmarkAddMultipleTimes/1000-4      1400           6              -99.57%
BenchmarkAddMultipleTimes/10000-4     2285           6              -99.74%

benchmark                             old bytes     new bytes     delta
BenchmarkAddMultipleTimes/10-4        13696         16528         +20.68%
BenchmarkAddMultipleTimes/100-4       100096        16528         -83.49%
BenchmarkAddMultipleTimes/1000-4      1878176       16528         -99.12%
BenchmarkAddMultipleTimes/10000-4     4976000       16528         -99.67%

So it is only slightly slower for small number of inserts and if you chose smaller compression that should not matter at all.

In my projects the speedup is measurable too - from 16 seconds to 6 seconds.

vmihailenco · 2018-11-01T13:48:57Z

For the record this is the benchmark for compression=20 which is just smaller initial fenwick tree:

benchmark                             old ns/op     new ns/op     delta
BenchmarkAddCompression1-4            162           164           +1.23%
BenchmarkAddCompression10-4           204           205           +0.49%
BenchmarkAddCompression100-4          268           267           -0.37%
BenchmarkAddMultipleTimes/10-4        4486          3925          -12.51%
BenchmarkAddMultipleTimes/100-4       59962         35770         -40.35%
BenchmarkAddMultipleTimes/1000-4      431884        303813        -29.65%
BenchmarkAddMultipleTimes/10000-4     2800701       2416271       -13.73%

benchmark                             old allocs     new allocs     delta
BenchmarkAddCompression1-4            0              0              +0.00%
BenchmarkAddCompression10-4           0              0              +0.00%
BenchmarkAddCompression100-4          0              0              +0.00%
BenchmarkAddMultipleTimes/10-4        36             7              -80.56%
BenchmarkAddMultipleTimes/100-4       213            7              -96.71%
BenchmarkAddMultipleTimes/1000-4      438            7              -98.40%
BenchmarkAddMultipleTimes/10000-4     641            7              -98.91%

benchmark                             old bytes     new bytes     delta
BenchmarkAddCompression1-4            0             0             +0.00%
BenchmarkAddCompression10-4           0             0             +0.00%
BenchmarkAddCompression100-4          4             0             -100.00%
BenchmarkAddMultipleTimes/10-4        4112          3744          -8.95%
BenchmarkAddMultipleTimes/100-4       45104         3744          -91.70%
BenchmarkAddMultipleTimes/1000-4      183056        3744          -97.95%
BenchmarkAddMultipleTimes/10000-4     393456        3744          -99.05%

caio · 2018-11-01T17:52:25Z

Whoa this is awesome, thanks a lot for the patches and well detailed PR!

I'll make time to look at it asap (this Saturday, most likely), but from a quick look it looks great already

vmihailenco · 2018-11-02T14:30:59Z

👍

Overall my understanding is that disabling fenwick tree until t-digest is saturated (number of centroids is close to maximum allowed by compression) going to be beneficial in most workloads. In other cases we still need to rebuild fenwick tree too often and the net result is that code with fenwick tree is slower than code that uses for loop.

caio · 2018-11-04T09:09:58Z

I see no disadvantages with this patch, should even help a bit with Merge-heavy workloads (there's a bit of discussion about one at #20) too. Merged, thanks a lot again!!

caio · 2018-11-04T09:36:34Z

For the record, I think (haven't verified) the fenwick tree change to use uint32 only works because our input is always sorted- if it could be unordered this would likely break:

func (l *List) Append(n uint32) {
    i := len(l.tree)
    l.tree = append(l.tree, 0)
    l.tree[i] = n - l.Get(i)                                                                                                                                                                                
}

vmihailenco · 2018-11-04T09:50:05Z

I see. I guess that is fine as long as it improves performance :)

I've tried to use honeycomb fork and performance is 40% better for my workload (lots of merges) even with this patch. But either its binary serialization is incompatible or I am using it wrong but I get negative quantiles when I replace caio/go-tdigest with honeycomb/go-tdigest. So I guess I will continue to improve performance of this lib using ideas from honeycomb fork...

vmihailenco added 3 commits November 1, 2018 14:14

Add benchmark

97d6ed9

Use bigger fenwick tree so we don't have to rebuild it on every add

945ad90

Change fenwick tree to accept uint32 to avoid extra copy

acec63d

caio mentioned this pull request Nov 4, 2018

Cherry-pick fixes from honeycombio fork? #20

Closed

caio merged commit cbc82d5 into caio:master Nov 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize fenwick tree manipulations #23

Optimize fenwick tree manipulations #23

vmihailenco commented Nov 1, 2018 •

edited

Loading

vmihailenco commented Nov 1, 2018

caio commented Nov 1, 2018

vmihailenco commented Nov 2, 2018

caio commented Nov 4, 2018

caio commented Nov 4, 2018

vmihailenco commented Nov 4, 2018

Optimize fenwick tree manipulations #23

Optimize fenwick tree manipulations #23

Conversation

vmihailenco commented Nov 1, 2018 • edited Loading

vmihailenco commented Nov 1, 2018

caio commented Nov 1, 2018

vmihailenco commented Nov 2, 2018

caio commented Nov 4, 2018

caio commented Nov 4, 2018

vmihailenco commented Nov 4, 2018

vmihailenco commented Nov 1, 2018 •

edited

Loading