Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize fenwick tree manipulations #23

Merged
merged 3 commits into from
Nov 4, 2018
Merged

Optimize fenwick tree manipulations #23

merged 3 commits into from
Nov 4, 2018

Conversation

vmihailenco
Copy link
Contributor

@vmihailenco vmihailenco commented Nov 1, 2018

There are 2 separate changes:

  1. Create bigger initial fenwick tree so we don't need to rebuild it on every add
benchmark                             old ns/op     new ns/op     delta
BenchmarkAddMultipleTimes/10-4        6903          11206         +62.34%
BenchmarkAddMultipleTimes/100-4       100042        87661         -12.38%
BenchmarkAddMultipleTimes/1000-4      1581666       1184668       -25.10%
BenchmarkAddMultipleTimes/10000-4     6562377       5292517       -19.35%

benchmark                             old allocs     new allocs     delta
BenchmarkAddMultipleTimes/10-4        35             7              -80.00%
BenchmarkAddMultipleTimes/100-4       305            7              -97.70%
BenchmarkAddMultipleTimes/1000-4      1400           7              -99.50%
BenchmarkAddMultipleTimes/10000-4     2285           7              -99.69%

benchmark                             old bytes     new bytes     delta
BenchmarkAddMultipleTimes/10-4        13696         28816         +110.40%
BenchmarkAddMultipleTimes/100-4       100096        28816         -71.21%
BenchmarkAddMultipleTimes/1000-4      1878176       28816         -98.47%
BenchmarkAddMultipleTimes/10000-4     4976000       28816         -99.42%

When number of adds is small the benchmark becomes slower because tree is bigger and number of rebuilds is not high. But as number of adds grows benchmark improves a lot.

  1. Fork fenwick tree implementation to use uint32 instead of int64 to avoid constructing extra array to pass it to the fenwick lib
benchmark                             old ns/op     new ns/op     delta
BenchmarkAddMultipleTimes/10-4        11206         8562          -23.59%
BenchmarkAddMultipleTimes/100-4       87661         74857         -14.61%
BenchmarkAddMultipleTimes/1000-4      1184668       1104090       -6.80%
BenchmarkAddMultipleTimes/10000-4     5292517       5050204       -4.58%

benchmark                             old allocs     new allocs     delta
BenchmarkAddMultipleTimes/10-4        7              6              -14.29%
BenchmarkAddMultipleTimes/100-4       7              6              -14.29%
BenchmarkAddMultipleTimes/1000-4      7              6              -14.29%
BenchmarkAddMultipleTimes/10000-4     7              6              -14.29%

benchmark                             old bytes     new bytes     delta
BenchmarkAddMultipleTimes/10-4        28816         16528         -42.64%
BenchmarkAddMultipleTimes/100-4       28816         16528         -42.64%
BenchmarkAddMultipleTimes/1000-4      28816         16528         -42.64%
BenchmarkAddMultipleTimes/10000-4     28816         16528         -42.64%

The cumulative result of 2 commits is following:

benchmark                             old ns/op     new ns/op     delta
BenchmarkAddMultipleTimes/10-4        6903          8562          +24.03%
BenchmarkAddMultipleTimes/100-4       100042        74857         -25.17%
BenchmarkAddMultipleTimes/1000-4      1581666       1104090       -30.19%
BenchmarkAddMultipleTimes/10000-4     6562377       5050204       -23.04%

benchmark                             old allocs     new allocs     delta
BenchmarkAddMultipleTimes/10-4        35             6              -82.86%
BenchmarkAddMultipleTimes/100-4       305            6              -98.03%
BenchmarkAddMultipleTimes/1000-4      1400           6              -99.57%
BenchmarkAddMultipleTimes/10000-4     2285           6              -99.74%

benchmark                             old bytes     new bytes     delta
BenchmarkAddMultipleTimes/10-4        13696         16528         +20.68%
BenchmarkAddMultipleTimes/100-4       100096        16528         -83.49%
BenchmarkAddMultipleTimes/1000-4      1878176       16528         -99.12%
BenchmarkAddMultipleTimes/10000-4     4976000       16528         -99.67%

So it is only slightly slower for small number of inserts and if you chose smaller compression that should not matter at all.

In my projects the speedup is measurable too - from 16 seconds to 6 seconds.

@vmihailenco
Copy link
Contributor Author

For the record this is the benchmark for compression=20 which is just smaller initial fenwick tree:

benchmark                             old ns/op     new ns/op     delta
BenchmarkAddCompression1-4            162           164           +1.23%
BenchmarkAddCompression10-4           204           205           +0.49%
BenchmarkAddCompression100-4          268           267           -0.37%
BenchmarkAddMultipleTimes/10-4        4486          3925          -12.51%
BenchmarkAddMultipleTimes/100-4       59962         35770         -40.35%
BenchmarkAddMultipleTimes/1000-4      431884        303813        -29.65%
BenchmarkAddMultipleTimes/10000-4     2800701       2416271       -13.73%

benchmark                             old allocs     new allocs     delta
BenchmarkAddCompression1-4            0              0              +0.00%
BenchmarkAddCompression10-4           0              0              +0.00%
BenchmarkAddCompression100-4          0              0              +0.00%
BenchmarkAddMultipleTimes/10-4        36             7              -80.56%
BenchmarkAddMultipleTimes/100-4       213            7              -96.71%
BenchmarkAddMultipleTimes/1000-4      438            7              -98.40%
BenchmarkAddMultipleTimes/10000-4     641            7              -98.91%

benchmark                             old bytes     new bytes     delta
BenchmarkAddCompression1-4            0             0             +0.00%
BenchmarkAddCompression10-4           0             0             +0.00%
BenchmarkAddCompression100-4          4             0             -100.00%
BenchmarkAddMultipleTimes/10-4        4112          3744          -8.95%
BenchmarkAddMultipleTimes/100-4       45104         3744          -91.70%
BenchmarkAddMultipleTimes/1000-4      183056        3744          -97.95%
BenchmarkAddMultipleTimes/10000-4     393456        3744          -99.05%

@caio
Copy link
Owner

caio commented Nov 1, 2018

Whoa this is awesome, thanks a lot for the patches and well detailed PR!

I'll make time to look at it asap (this Saturday, most likely), but from a quick look it looks great already

@vmihailenco
Copy link
Contributor Author

👍

Overall my understanding is that disabling fenwick tree until t-digest is saturated (number of centroids is close to maximum allowed by compression) going to be beneficial in most workloads. In other cases we still need to rebuild fenwick tree too often and the net result is that code with fenwick tree is slower than code that uses for loop.

@caio caio merged commit cbc82d5 into caio:master Nov 4, 2018
@caio
Copy link
Owner

caio commented Nov 4, 2018

I see no disadvantages with this patch, should even help a bit with Merge-heavy workloads (there's a bit of discussion about one at #20) too. Merged, thanks a lot again!!

@caio
Copy link
Owner

caio commented Nov 4, 2018

For the record, I think (haven't verified) the fenwick tree change to use uint32 only works because our input is always sorted- if it could be unordered this would likely break:

func (l *List) Append(n uint32) {
    i := len(l.tree)
    l.tree = append(l.tree, 0)
    l.tree[i] = n - l.Get(i)                                                                                                                                                                                
}

@vmihailenco
Copy link
Contributor Author

I see. I guess that is fine as long as it improves performance :)

I've tried to use honeycomb fork and performance is 40% better for my workload (lots of merges) even with this patch. But either its binary serialization is incompatible or I am using it wrong but I get negative quantiles when I replace caio/go-tdigest with honeycomb/go-tdigest. So I guess I will continue to improve performance of this lib using ideas from honeycomb fork...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants