TFDV uses weird float value for sample_count of generated histograms #182

liwii · 2021-07-16T08:23:35Z

When I generate statistics from a .tfrecord file with generate_statistics_from_tfrecord, its histograms contain weird float values as the sample_counts of the buckets.
For example, in one bucket which is supposed to contain 10 samples, sample_count: 9.94000000834465 is used instead. How can I set the exact integer sample_count for each bucket?

Here's a Colab to reproduce.

The text was updated successfully, but these errors were encountered:

kennysong · 2021-09-24T03:26:03Z

Is there any update (or explanation) for this behavior?

paulgc · 2021-09-24T19:08:17Z

TFDV currently uses an approximate method to determine the bucket boundaries in a single pass. The float values are due to this. One option would be to do some post-processing to round the values.

kennysong · 2021-09-25T05:14:16Z

Got it, thanks for the explanation. Are there any error bounds on the approximate counts? (i.e. it's within +-1 of the true count)

arghyaganguly self-assigned this Jul 19, 2021

arghyaganguly added the type:support label Jul 19, 2021

arghyaganguly assigned caveness Jul 19, 2021

arghyaganguly added type:feature and removed type:support labels Jul 19, 2021

arghyaganguly removed their assignment Jul 19, 2021

arghyaganguly added the stat:awaiting tensorflower label Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFDV uses weird float value for sample_count of generated histograms #182

TFDV uses weird float value for sample_count of generated histograms #182

liwii commented Jul 16, 2021

kennysong commented Sep 24, 2021

paulgc commented Sep 24, 2021

kennysong commented Sep 25, 2021

TFDV uses weird float value for sample_count of generated histograms #182

TFDV uses weird float value for sample_count of generated histograms #182

Comments

liwii commented Jul 16, 2021

kennysong commented Sep 24, 2021

paulgc commented Sep 24, 2021

kennysong commented Sep 25, 2021