Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding AVLTreeDigest function to set a specific random seed #183

Merged
merged 3 commits into from
Jan 4, 2022

Conversation

cedric-hansen
Copy link
Contributor

The random element in TDigest can cause some unpredictability in certain use cases.
This commit adds a second constructor to AVLTreeDigest, which allows a specific random seed to be used.

Tests have been added to verify that this option does not change the behaviour of the standard AVLTreeDigest constructor

@cedric-hansen cedric-hansen changed the title Adding AVLTreeDigest option to use a specific random seed WIP: Adding AVLTreeDigest option to use a specific random seed Dec 22, 2021
@cedric-hansen cedric-hansen marked this pull request as draft December 22, 2021 18:59
Copy link
Owner

@tdunning tdunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need more review than this one comment.

Overall, I can see the value of being able to nail down randomness, but I don't like the idea of arbitrary types being deserialized.

Have you created a corresponding issue for this pull request?

ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
try {
ObjectInputStream ois = new ObjectInputStream(bais);
Random r = (Random)ois.readObject();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the object here constrained? I don't like creating objects of unknown type based on user data.

Copy link
Contributor Author

@cedric-hansen cedric-hansen Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes here isn't constrained - it was meant to cleanup the code in fromBytes () a bit, but in retrospect, you're right - it doesn't really make sense to expose this function to the rest of the class.

I'll make a note to move this into fromBytes() if serializing the Random object ends up being the path forward after some additional discussion

@tdunning
Copy link
Owner

tdunning commented Dec 23, 2021 via email

@cedric-hansen
Copy link
Contributor Author

Thanks for getting a first look on this PR so quickly @tdunning !

I will need more review than this one comment.

Overall, I can see the value of being able to nail down randomness, but I don't like the idea of arbitrary types being deserialized.

Fair enough. I'll try to think of some other solutions myself. For a bit of extra context, what my team and I are aiming to get, is a way to consistently produce the exact same data structure, given the same points, and potentially new data points. At the moment, when we recreate an AVLTreeDigest, the random element in the implementation results in a slightly different data structure, which is an undesirable result for us. Using a specific random object with a specific seed, allows us to recreate the exact same data structure when we create a new tree with old data, at which point, the random object is restored to the same state.

Have you created a corresponding issue for this pull request?

I have not. If you'd like me to outline in greater detail the rationale as to why my team and I are looking to add this functionality, then I'd be happy to create an issue.

Is there a reason you are using AVLTreeDigest instead of the MergingDigest?

My team and I ran some benchmarks with the various TDigest implementations, and chose AVLTreeDigest with a small compression factor (5), since it runs significantly faster than other implementations, estimates the median value within our acceptable margin of error, and takes up a fairly small memory footprint. Speed is arguably the most important performance metric for us, and AVLTreeDigest was the fastest according to our testing - is this consistent with your findings/knowledge? If there is a way to speed up other implementations without sacrificing the accuracy tDigest.quantile(0.5), then we would certainly be open to testing that out to see if it better suits our particular use case

@tdunning
Copy link
Owner

tdunning commented Dec 23, 2021 via email

@cedric-hansen
Copy link
Contributor Author

Yeah... an issue would be a good thing.
Sounds good, I'll start working on one - likely won't submit until after the holidays.

Thanks for sharing that issue, I ran into it a few times during my testing, good to know that's a transient failure

I just re-ran some experiments comparing AVLTreeDigest and MergingDigest, and they were surprisingly roughly the same speed, however AVLTreeDigest was significantly more accurate (~10x closer to the median than MergingDigest). My original experiments were with TDigest3.1, where MergingDigest was noticeably slower than AVLTreeDigest. In 3.1 the speed seems to be about the same in both implementations.

I'll talk to my team after the holidays to see if the accuracy of using MergingDigest(5) is within our margin of error. If so, then we might try that approach instead of AVLTreeDigest with some of the work/ideas outlined in this PR

@tdunning
Copy link
Owner

tdunning commented Dec 23, 2021 via email

The random element in TDigest can cause some unpredictability in certain use cases.
This commit adds a second constructor to `AVLTreeDigest`, which allows a specific random obj to be used.
If this constructor is used, then the Random object will be persisted, such that the random number generation
is consistent.

Tests have been added to verify that this option does not change the behaviour of the standard `AVLTreeDigest` constructor
@cedric-hansen
Copy link
Contributor Author

This work is related to #185

@cedric-hansen
Copy link
Contributor Author

I've had a chance to re-think the overall approach to tackle issue #185 , and have made the appropriate code changes.
It's really as simple as adding a function to allow a specific seed to be used in the random object.

If someone wants to change the seed once, they can just call setRandomSeed(<some_seed>) right after calling the constructor. This results in the same sequence of random numbers to be called, and ensures that an identical tree can be built if the same set of numbers are added in the same sequence. Alternatively, if someone want to use some other seed before each add() call, then a setRandomSeed() can be used there as well.

This approach is far simpler, makes trees deterministic (if thats the desirable outcome), and doesn't change any serialization/deserialization (which maintains some backwards compatibility).

A few open questions:

  • Should another add() method be added, which supports a random seed as a parameter? My gut feeling is no
  • Thoughts on a constructor taking in a seed, and then calling setRandomSeed() from there?

Happy to hear your thoughts on this simpler approach @tdunning

@tdunning
Copy link
Owner

tdunning commented Jan 4, 2022

I like the simpler approach.

To your questions:

Should another add() method be added, which supports a random seed as a parameter? My gut feeling is no

I concur. A setter is enough.

Thoughts on a constructor taking in a seed, and then calling setRandomSeed() from there?

No strong opinion. I prefer simpler, but it is a two-liner so it isn't a big difference.

@pedrolamarao
Copy link
Contributor

  • Thoughts on a constructor taking in a seed, and then calling setRandomSeed() from there?

Have you considered the option of having just the constructor parameter and not have the setter? That would assure some hypotetical machine verifier that the initialized object will necessarily give the expected result, since changing the seed after construction will be impossible. Not sure if this an important property, but, if the original RNG is a final attribute, maybe it is.

@cedric-hansen
Copy link
Contributor Author

I like the simpler approach.

To your questions:

Should another add() method be added, which supports a random seed as a parameter? My gut feeling is no

I concur. A setter is enough.

Thoughts on a constructor taking in a seed, and then calling setRandomSeed() from there?

No strong opinion. I prefer simpler, but it is a two-liner so it isn't a big difference.

Sounds good, works for me. I guess with that being said, I'll mark this PR as being ready for final review

@cedric-hansen cedric-hansen changed the title WIP: Adding AVLTreeDigest option to use a specific random seed Adding AVLTreeDigest function to set a specific random seed Jan 4, 2022
@cedric-hansen cedric-hansen marked this pull request as ready for review January 4, 2022 19:50
@cedric-hansen cedric-hansen requested a review from tdunning January 4, 2022 19:51
@tdunning
Copy link
Owner

tdunning commented Jan 4, 2022 via email

Copy link
Owner

@tdunning tdunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@tdunning tdunning merged commit c670bba into tdunning:main Jan 4, 2022
@cedric-hansen
Copy link
Contributor Author

Have you considered the option of having just the constructor parameter and not have the setter? That would assure some hypothetical machine verifier that the initialized object will necessarily give the expected result, since changing the seed after construction will be impossible. Not sure if this an important property, but, if the original RNG is a final attribute, maybe it is.

@pedrolamarao Good question, I'll defer that to @tdunning .

As for a case where setting the seed each time might be useful, consider the following:
We have a set of events {(x1,y1), (x2, y2), ... } that we want to add to the tree (specifically, the x component).
If we add these events in order, we can do something like setSeed(y_i) and then add(x_i), which will ensure we get a consistent tree, given the exact same set. In short, setting the seed at different points in time might prove to be useful. In other cases, setting it only once (or never), might be a more suitable application

@cedric-hansen
Copy link
Contributor Author

@tdunning Is there a timeline for this PR to be included in a new release, perhapsv3.4?
I see on the readme there are a handful of items listed for the v4.0 release, so not sure if that is the next anticipated release

@tdunning
Copy link
Owner

tdunning commented Jan 4, 2022 via email

@cedric-hansen
Copy link
Contributor Author

cedric-hansen commented Jan 5, 2022

Do you think that there is a viable way for me to delegate release
management for 3.4 to you?

I'll check with my company's open source guidelines to see if we have a standard procedure for releasing a third party library, and I'll get back to you.

As a side note: I am just noticing I never squashed my commits in this PR. caa728d335c57b51deb979d32ac4eb8c3bca593b and addee2bfc122cdf32a17d9acfeea53027f25a76d commits can be removed entirely (one commit simply reverts the other - this is essentially the content of the initial PR where I was messing with serialization). I don't have write access to main, so I can't remove them myself. Just thought I'd give you the heads up (and apologize) that the commit history is a bit messy in main right now. The commit for c670bbac34261aeb0a2749d6c501ffa720847164 should read

Adding setter method for AVLTreeDigest gen object

    The random element in TDigest can cause some unpredicatability in certain use cases, where a reproducible tree may be desirable.
    This commit adds a setter function which allows the seed of the `gen` object in `AVLTreeDigest` to be changed,
    allowing trees to be reproducible (if the same set of values in the same order are added to the tree)

@tdunning
Copy link
Owner

tdunning commented Jan 5, 2022 via email

@tdunning
Copy link
Owner

tdunning commented Jan 5, 2022 via email

@cedric-hansen
Copy link
Contributor Author

IF you manage the release, it would need to be via github processes, not a
company process.

Sounds good. I have not hear anything back yet about my company's policy surrounding managing the release of an open source project that we don't "own". That being said, to err on the side of caution, my team can wait for whatever the next release will be. We've opted to use some (sigh) reflection as a temporary workaround.

If I hear anything back and get the green light to manage the release, I'll reach back out in this issue!

@tdunning
Copy link
Owner

tdunning commented Jan 10, 2022 via email

@mccartney mccartney mentioned this pull request Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants