diff --git a/README.md b/README.md index 258f99a..2b53811 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,20 @@ LDWorkbench is a Linked Data Transformation tool designed to use only SPARQL as This project is currently in a Proof of Concept phase, feel free to watch our progress, but please do not use this project in a production setup. +## How an LD Workbench pipelines works + +A *pipeline* is the set of instructions that are run to transform Linked Data. It consists of *stages* with *iterators* and *generators*. + +The idea of this project is to use SPARQL `select` to create an iterator of iri's (defined by *binding* `$this`) from an endpoint or local RDF file. This makes it possible to go over huge datasets by paginating results using SPARQL `offset` and `limit` parameters. Each yield of `$this` is then used as input for a SPARQL `construct` query that will be [pre-binded](https://www.w3.org/TR/shacl/#pre-binding) with `$this`. The generator creates RDF statements that will be part of the endresult of the workbench pipeline. + +Each pipeline consists of 1 or more *stages*, where a *stage* is the combination of 1 iterator and 1 generator (more that 1 generator will be implemented later). + +A workbench pipeline is defined by a configuration file, stored in [YAML](https://yaml.org). The configuration is validated using a [JSON Schema](https://json-schema.org). The schema [is part of this repository](https://github.com/netwerk-digitaal-erfgoed/ld-workbench/blob/main/static/ld-workbench.schema.json). The easiest way to work with YAML files and JSON Schemas is to use Microsoft's [Visual Studio Code](https://code.visualstudio.com). If you follow the installation instructions and use the `--init` script, your workbench project will contain the correct settings to work with YAML files and JSON Schemas without any extra settings. + +A pipeline must have a `name`, 1 or more `stages` and optionaly a `description`. If you have multiple pipelines, each pipepline must have a unique name. See the [example configuration file](https://github.com/netwerk-digitaal-erfgoed/ld-workbench/blob/main/static/example/config.yml) for a boilerplate configuration file. A visualisation of the schema giving more insights on required and optional properties can be [found here](https://json-schema.app/view/%23?url=https%3A%2F%2Fraw.githubusercontent.com%2Fnetwerk-digitaal-erfgoed%2Fld-workbench%2Fmain%2Fstatic%2Fld-workbench.schema.json). + ## Install & Usage -The quickest way to get started with LDWorkbench is follow these instruction: +The quickest way to get started with LDWorkbench is to follow these instruction: ```bash mkdir ldworkbench @@ -20,8 +32,20 @@ Your workbench is now ready for use. An example workbench is provided, run it wi npx ldworkbench ``` -### Configuring a workbench project +### Configuring a workbench pipeline +To keep your workbench workspace clean, we recommend to create a folder for each pipeline that contains the configuration and the SPARQL select and construct queries. The application uses the folder `pipelines/configurations` by default to look for YAML configurations of pipelines, so it is best to save your configuratiosn there. + +An example pipeline folders and files structure might look like this: +``` +your-working-dir +|-- pipelines +| |-- configurations +| | |-- my-pipeline +| | | |-- configuration.yaml +| | | |-- select.rq +| | | |-- construct.rq +``` ## Development For local development, these script should get you going: