Consider datapackage.json output option with csv data #215

kindly · 2022-04-28T20:05:12Z

The Tabular data package specification seems like a good fit for the CSV output spoonbill produces.

Giviing a datapackage.json alongside the CSV files means that you could use the frictionles data tools to convert the output to different formats users may want. There are converters for all types of databases and ODS/Bigquery too.

Recently I have been working on a thing I am calling datapackage convert, which is aimed to be a small set of very fast versions (in rust) of such conversions. Currenly only parquet and sqlite.

Included in that is a way to merge datapacakges together. This could be benificial if spoonbill is doing conversions over multiple files and you want a way to combine them at the end. It can also be used for parralelism, i.e spitting up a file so that multiple processes can work on them. This has the caveat that the outputs need to have a similar structure (i.e one output can not do a combine while the other one does not) but even so it may give a way of combining datasets from variaty of sources.

I think standadising on an itermediate format seems like a good way to collaborate on these tools and datapackage is the best fit I think.

jpmckinney · 2022-04-28T22:48:58Z

Do many people really use Frictionless though?

For combining datasets, that requires datapackages to be ubiquitous, which seems to not be the case...

In any case, does Frictionless not include tools to auto-generate a datapackage from a CSV (or set of CSVs)?

kindly · 2022-04-29T00:36:34Z

I think the users just want to get from a to b and they do not care about datapackages. It is more about tool interoperability.

With flatterer I am basically moving to a model where all it is doing is producing a csv datapackage and I will be moving all conversions to other formats to external libraries. The only one left to move is the xlsx converter. In profiling, CSV writing is so fast that there is no major cost if always going through it.

If spoonbill produced datapackages too then it could benefit from the same converters. Or if it did not use them as libraries itself it would be trivial to write a wrapper around spoonbill that did or write documentation to explain how.

So if the user wanted to the data from spoonbill to be in a sqlite database or uploaded to big query or any other format there would be an easy path.

Making a datapackage from just the raw the CSVs is very limiting and would require doing things that spoonbill has already done whilst loosing information. Mainly:

type guessing would need to be done again and would not have a schema to help.
primary key and foreign key columns would have to be guessed.
descriptions of columns that can be used as comments on fields in databases or embedded in avro files for big query.

So conversions to other formats would loose a lot of utility if just inspecting the CSVs. Also those kind of auto detection routines are slow.

jpmckinney · 2022-04-29T02:52:41Z

That makes sense in the context of the changes you're making to flatterer. I'll keep the issue open.

kindly changed the title ~~Consider datapackage.json output options with csv data~~ Consider datapackage.json output option with csv data Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider datapackage.json output option with csv data #215

Consider datapackage.json output option with csv data #215

kindly commented Apr 28, 2022 •

edited

Loading

jpmckinney commented Apr 28, 2022

kindly commented Apr 29, 2022 •

edited

Loading

jpmckinney commented Apr 29, 2022

Consider datapackage.json output option with csv data #215

Consider datapackage.json output option with csv data #215

Comments

kindly commented Apr 28, 2022 • edited Loading

jpmckinney commented Apr 28, 2022

kindly commented Apr 29, 2022 • edited Loading

jpmckinney commented Apr 29, 2022

kindly commented Apr 28, 2022 •

edited

Loading

kindly commented Apr 29, 2022 •

edited

Loading