Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFDV - Slicing data based on date range #132

Open
srinivasaraov opened this issue Jul 9, 2020 · 9 comments
Open

TFDV - Slicing data based on date range #132

srinivasaraov opened this issue Jul 9, 2020 · 9 comments

Comments

@srinivasaraov
Copy link

In Tensorflow Data Validation, there is a method slicing_util.get_feature_value_slicer() to slice data based on a feature value.

Is it possible to slice the data based on a date range using the above method and compare the sliced datasets ?

Let's say, I have 'n' records within date range t1-t10. If I want to split the data into 4 sets which fall in date ranges t1-t3, t4-t6 and t8-t10, is it possible with above slicing method?

@rmothukuru
Copy link

@srinivasaraov,
Can you please check the source code of get_feature_value_slicer along with the description of that function, and let us know if it helps. Thanks!

@srinivasaraov
Copy link
Author

@rmothukuru : I see the following documentation in the source code.

Raises:
TypeError: If feature values are not specified in an iterable.
NotImplementedError: If a value of a type other than string or integer is
specified in the values iterable in features.

So, I'm assuming specifying a date range is supported. Is that correct?

@brills
Copy link
Contributor

brills commented Jul 10, 2020

what is the type of your date/timestamp feature?

I don't think the default slicer will be able to slice by ranges but you can implement your own slicer.
A slicer is just a function that takes a pa.RecordBatch and returns a List[Tuple[Text, pa.RecordBatch]], where the first term in the tuple is the slice key, and the second term is the RecordBatch that contains only rows corresponding to the slice key.

@brills
Copy link
Contributor

brills commented Jul 10, 2020

btw, we are looking at allowing using SQL statements to do slicing which may be able to support your use case. However there's no timeline yet.

@srinivasaraov
Copy link
Author

Thanks @brills

Type of date/timestamp is DATETIME.

Could you please point me to any example of custom slicer implementation if possible?

@brills
Copy link
Contributor

brills commented Jul 14, 2020

sorry, which DATETIME type did you mean? I don't think TFDV supports such types (only integral, floating and string/bytes).

Our feature value slicer is no exception than other potential custom slicers:

def feature_value_slicer(record_batch: pa.RecordBatch) -> Iterable[

@axeltidemann
Copy link

Did you eventually implement this slicer yourself, @srinivasaraov ?

@srinivasaraov
Copy link
Author

@axeltidemann : Not yet. This was deprioritised for us at the moment. I will update when I implement this.

@axeltidemann
Copy link

Cool, I think it would be a very useful feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants