Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve validate processor #171

Open
pwalsh opened this issue Sep 19, 2021 · 3 comments
Open

Improve validate processor #171

pwalsh opened this issue Sep 19, 2021 · 3 comments

Comments

@pwalsh
Copy link
Contributor

pwalsh commented Sep 19, 2021

DF.validate() does some basic checks but doesn't validate everything that is possible based on Table Schema. In particular, it does not validate primary keys and we have noted that this creates other currently untraced bugs (e.g.: load from a package with invalid primary keys and try to dump again, the package will be incomplete).

We need to explore one of:

The problem with adopting Frictionless is that it can't be incrementally adopted AFAIK - the validation is built into the Resource class and I don't know just from reading the code where that leads (if / how it complicates our code when we use different libraries for managing Frictionless Data specs). Also, it sets state in memory (seen data for primary keys and foreign keys), and I guess based on other patterns in Dataflows we would want to store that data outside of the running python process ( e.g.: using https://github.com/akariv/kvfile ).

@pwalsh
Copy link
Contributor Author

pwalsh commented Sep 19, 2021

Currently known issues:

  1. Does not validate primary keys
  2. Does not validate foreign keys
  3. If field format is None (which is an invalid value according to the spec), it validates, but fails in dump_to_sql
  4. Does not validate field.constraints (e.g.: unique)

@akariv
Copy link
Member

akariv commented Sep 19, 2021 via email

@pwalsh
Copy link
Contributor Author

pwalsh commented Sep 20, 2021

Looks like the only data validation is done via tableschema.Field.cast_value:

row[field.name] = field.cast_value(row.get(field.name))

As that only checks field values, it means that points (1) and (2) in #171 (comment) are not checked, for point (3) I'm not sure what is going on, will need to create a failing test. For point (4), cast_value has an unusual signature where if constraints is True, the default, it does not check constraints, so that is also an issue.

https://github.com/frictionlessdata/tableschema-py/blob/main/tableschema/field.py#L138

There are all easily addressed, but I agree it may be a good motivator to explore moving this area to frictionless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants