-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider datapackage.json output option with csv data #215
Comments
Do many people really use Frictionless though? For combining datasets, that requires datapackages to be ubiquitous, which seems to not be the case... In any case, does Frictionless not include tools to auto-generate a datapackage from a CSV (or set of CSVs)? |
I think the users just want to get from a to b and they do not care about datapackages. It is more about tool interoperability. With flatterer I am basically moving to a model where all it is doing is producing a csv datapackage and I will be moving all conversions to other formats to external libraries. The only one left to move is the xlsx converter. In profiling, CSV writing is so fast that there is no major cost if always going through it. If spoonbill produced datapackages too then it could benefit from the same converters. Or if it did not use them as libraries itself it would be trivial to write a wrapper around spoonbill that did or write documentation to explain how. So if the user wanted to the data from spoonbill to be in a sqlite database or uploaded to big query or any other format there would be an easy path. Making a datapackage from just the raw the CSVs is very limiting and would require doing things that spoonbill has already done whilst loosing information. Mainly:
So conversions to other formats would loose a lot of utility if just inspecting the CSVs. Also those kind of auto detection routines are slow. |
That makes sense in the context of the changes you're making to flatterer. I'll keep the issue open. |
The Tabular data package specification seems like a good fit for the CSV output spoonbill produces.
Giviing a datapackage.json alongside the CSV files means that you could use the frictionles data tools to convert the output to different formats users may want. There are converters for all types of databases and ODS/Bigquery too.
Recently I have been working on a thing I am calling datapackage convert, which is aimed to be a small set of very fast versions (in rust) of such conversions. Currenly only parquet and sqlite.
Included in that is a way to merge datapacakges together. This could be benificial if spoonbill is doing conversions over multiple files and you want a way to combine them at the end. It can also be used for parralelism, i.e spitting up a file so that multiple processes can work on them. This has the caveat that the outputs need to have a similar structure (i.e one output can not do a combine while the other one does not) but even so it may give a way of combining datasets from variaty of sources.
I think standadising on an itermediate format seems like a good way to collaborate on these tools and datapackage is the best fit I think.
The text was updated successfully, but these errors were encountered: