Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data ranges and presence #4

Open
theofpa opened this issue May 13, 2021 · 0 comments
Open

Data ranges and presence #4

theofpa opened this issue May 13, 2021 · 0 comments

Comments

@theofpa
Copy link

theofpa commented May 13, 2021

Data ranges

In the numerical feature types like REAL, we could have some descriptive statistics like min/max/avg/std to increase the expressiveness of the schema. This way, we can

  1. Use it for data validation on inference time. For example, a tranformer can perform the task of feature data validation on received data points. When a feature is not within the range defined by min/max values, it can log the error accordingly, for example increase an outlier counter/metric.
  2. Use the trained data distribution information to compare it against calculated distributions of inference requests batches. For example using some KL based distance method to increase a skew/drift detection counter/metric.

Similarly to the numerical, store the distribution of the category_map.

Data presence

In all feature types, define an attribute to specify whether a feature is supposed to be mandatory for inference or not. For example if there are no missing values on a particular feature during training time, most probably we'd like to require this feature in the inference request. A transformer performing the data validation task can handle this error and increase an anomaly detection counter/metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant