Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFDV uses weird float value for sample_count of generated histograms #182

Open
liwii opened this issue Jul 16, 2021 · 3 comments
Open

TFDV uses weird float value for sample_count of generated histograms #182

liwii opened this issue Jul 16, 2021 · 3 comments

Comments

@liwii
Copy link

liwii commented Jul 16, 2021

When I generate statistics from a .tfrecord file with generate_statistics_from_tfrecord, its histograms contain weird float values as the sample_counts of the buckets.
For example, in one bucket which is supposed to contain 10 samples, sample_count: 9.94000000834465 is used instead. How can I set the exact integer sample_count for each bucket?

Here's a Colab to reproduce.

@kennysong
Copy link

Is there any update (or explanation) for this behavior?

@paulgc
Copy link
Member

paulgc commented Sep 24, 2021

TFDV currently uses an approximate method to determine the bucket boundaries in a single pass. The float values are due to this. One option would be to do some post-processing to round the values.

@kennysong
Copy link

Got it, thanks for the explanation. Are there any error bounds on the approximate counts? (i.e. it's within +-1 of the true count)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants