Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider datapackage.json output option with csv data #215

Open
kindly opened this issue Apr 28, 2022 · 3 comments
Open

Consider datapackage.json output option with csv data #215

kindly opened this issue Apr 28, 2022 · 3 comments

Comments

@kindly
Copy link

kindly commented Apr 28, 2022

The Tabular data package specification seems like a good fit for the CSV output spoonbill produces.

Giviing a datapackage.json alongside the CSV files means that you could use the frictionles data tools to convert the output to different formats users may want. There are converters for all types of databases and ODS/Bigquery too.

Recently I have been working on a thing I am calling datapackage convert, which is aimed to be a small set of very fast versions (in rust) of such conversions. Currenly only parquet and sqlite.

Included in that is a way to merge datapacakges together. This could be benificial if spoonbill is doing conversions over multiple files and you want a way to combine them at the end. It can also be used for parralelism, i.e spitting up a file so that multiple processes can work on them. This has the caveat that the outputs need to have a similar structure (i.e one output can not do a combine while the other one does not) but even so it may give a way of combining datasets from variaty of sources.

I think standadising on an itermediate format seems like a good way to collaborate on these tools and datapackage is the best fit I think.

@kindly kindly changed the title Consider datapackage.json output options with csv data Consider datapackage.json output option with csv data Apr 28, 2022
@jpmckinney
Copy link
Member

Do many people really use Frictionless though?

For combining datasets, that requires datapackages to be ubiquitous, which seems to not be the case...

In any case, does Frictionless not include tools to auto-generate a datapackage from a CSV (or set of CSVs)?

@kindly
Copy link
Author

kindly commented Apr 29, 2022

I think the users just want to get from a to b and they do not care about datapackages. It is more about tool interoperability.

With flatterer I am basically moving to a model where all it is doing is producing a csv datapackage and I will be moving all conversions to other formats to external libraries. The only one left to move is the xlsx converter. In profiling, CSV writing is so fast that there is no major cost if always going through it.

If spoonbill produced datapackages too then it could benefit from the same converters. Or if it did not use them as libraries itself it would be trivial to write a wrapper around spoonbill that did or write documentation to explain how.

So if the user wanted to the data from spoonbill to be in a sqlite database or uploaded to big query or any other format there would be an easy path.

Making a datapackage from just the raw the CSVs is very limiting and would require doing things that spoonbill has already done whilst loosing information. Mainly:

  • type guessing would need to be done again and would not have a schema to help.
  • primary key and foreign key columns would have to be guessed.
  • descriptions of columns that can be used as comments on fields in databases or embedded in avro files for big query.

So conversions to other formats would loose a lot of utility if just inspecting the CSVs. Also those kind of auto detection routines are slow.

@jpmckinney
Copy link
Member

That makes sense in the context of the changes you're making to flatterer. I'll keep the issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants