Skip to content

Commit

Permalink
docs: Improve (#81)
Browse files Browse the repository at this point in the history
  • Loading branch information
ddeboer authored Jun 12, 2024
1 parent d6c47b8 commit 86e6602
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 19 deletions.
95 changes: 82 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,36 @@
# LD Workbench

LD Workbench is a command-line tool for transforming large RDF datasets using pure SPARQL.
LD Workbench is a command-line tool for transforming large RDF datasets using pure [SPARQL](https://www.w3.org/TR/sparql11-query/).

This project is currently in a Proof-of-Concept phase.
> [!NOTE]
> Although LD Workbench is stable, we consider it a proof of concept.
> Please use the software and report any [issues](https://github.com/netwerk-digitaal-erfgoed/ld-workbench/issues) you encounter.
## Approach

The main design principes are scalability and extensibility.
### Components

Users define LD Workbench **pipelines**. An LD Workbench pipeline reads data from SPARQL endpoints,
transforms it using SPARQL queries, and writes the result to a file or triple store.

A pipeline consists of one or more **stages**. Each stage has:

### Scalability
- an **iterator**, which selects URIs from a dataset using a paginated SPARQL SELECT query,
binding each URI to a `$this` variable
- one or more **generators**, which generate triples about each URI using SPARQL CONSTRUCT queries.

LD Workbench is **scalable** due to its iterator/generator approach:
Stages can be chained together, with the output of one stage becoming the input of the next.

* the **iterator** component fetches URIs using a SPARQL SELECT query, paginating results using SPARQL `OFFSET` and `LIMIT` (binding each URI to a `$this` variable)
* the **generator** component then runs a SPARQL CONSTRUCT query for each URI ([pre-binding](https://www.w3.org/TR/shacl/#pre-binding) `$this` to the URI), which returns the transformed result.
### Design principles

The main design principes are scalability and extensibility.

### Extensible
LD Workbench is **scalable** due to its iterator/generator approach,
which separates the selection of URIs from the generation of triples.

LD Workbench is **extensible** because it uses pure SPARQL queries (instead of code) for configuring transformation pipelines.
Each pipeline is a sequence of stages; each stage consists of an iterator and generator.
LD Workbench is **extensible** because it uses pure SPARQL queries (instead of code or a DSL) for configuring transformation pipelines.
The [SPARQL query language](https://www.w3.org/TR/sparql11-query/) is a widely supported W3C standard,
so users will not be locked into a proprietary tool or technology.

## Usage

Expand All @@ -43,18 +55,75 @@ Your workbench is now ready for use. You can continue by creating your own pipel

An LD Workbench pipeline is defined with a YAML configuration file, validated by a [JSON Schema](https://json-schema.app/view/%23?url=https%3A%2F%2Fraw.githubusercontent.com%2Fnetwerk-digitaal-erfgoed%2Fld-workbench%2Fmain%2Fstatic%2Fld-workbench.schema.json).

A pipeline must have a name, one or more stages, and optionally a description. Multiple pipelines can be configured as long as they have unique names. See the [example configuration file](https://github.com/netwerk-digitaal-erfgoed/ld-workbench/blob/main/static/example/config.yml) for a boilerplate configuration file.
A pipeline must have a name, one or more stages, and optionally a description. Multiple pipelines can be configured as long as they have unique names.
See the [example configuration file](https://github.com/netwerk-digitaal-erfgoed/ld-workbench/blob/main/static/example/config.yml) for a boilerplate configuration file.
You can find more examples in the [ld-workbench-configuration](https://github.com/netwerk-digitaal-erfgoed/ld-workbench-configuration) repository.

#### Iterator

Each stage has a single iterator. The iterator SPARQL SELECT query must return a `$this` binding for each URI that will be passed to the generator(s).

The query can be specified either inline:

```yaml
# config.yml
stages:
- name: Stage1
iterator:
query: "SELECT $this WHERE { $this a <https://schema.org/Thing> }"
```

or by referencing a file:

```yaml
# config.yml
stages:
- name: Stage1
iterator:
query: file://iterator.rq
```

```sparql
# iterator.rq
prefix schema: <https://schema.org/>
select $this where {
$this a schema:Thing .
}
```
> [!TIP]
> LD Workbench paginates iterator queries (using SPARQL `LIMIT/OFFSET`) to support large datasets.
> However, a large `OFFSET` can be slow on SPARQL endpoints.
> Therefore, prefer creating multiple stages to process subsets (for example each RDF type separately) over processing the entire dataset in a single stage.
#### Generator
A stage has one or more generators, which are run for each individual URI from the iterator.
A SPARQL CONSTRUCT query takes a `$this` binding from the iterator and generates triples about it.
Just as with the iterator query, the query can be specified either inline or by referencing a file:
```yaml
# config.yml
stages:
- name: Stage1
generator:
- query: "CONSTRUCT { $this a <https://schema.org/CreativeWork> } WHERE { $this a <https://schema.org/Book> }"
```
#### Example YAML File For Configuration Options
#### Example configuration
```yaml
# config.yml
name: MyPipeline
description: Example pipeline configuration
destination: output/result.ttl
stages:
- name: Stage1
iterator:
query: "SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 100"
query: "SELECT $this WHERE { $this a <https://schema.org/Thing> }"
endpoint: "http://example.com/sparql-endpoint"
generator:
- query: "CONSTRUCT { ?s ?p ?o } WHERE { ?s ?p ?o }"
Expand Down
11 changes: 5 additions & 6 deletions static/ld-workbench.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@
},
"baseDir": {
"type": "string",
"description": "An optional base directory for files referenced by `file://...` paths.",
"default": "The directory that contains the YAML config file."
"description": "An optional base directory for files referenced by `file://...` paths. This defaults to the directory that contains the YAML configuration file."
},
"destination": {
"type": "string",
Expand Down Expand Up @@ -51,8 +50,7 @@
"batchSize": {
"type": "number",
"minimum": 1,
"description": "Number of `$this` bindings retrieved per query.",
"default": "The LIMIT value of your iterator query or 10 if no LIMIT is present."
"description": "Number of `$this` bindings retrieved per query. Defaults to the LIMIT value of your iterator query or 10 if no LIMIT is present."
},
"delay": {
"type": "string",
Expand All @@ -74,12 +72,13 @@
},
"endpoint": {
"type": "string",
"description": "The SPARQL endpoint for the generator. \nIf it starts with \"file://\", a local RDF file is queried.\nIf ommmitted the endpoint of the Iterator is used."
"description": "The SPARQL endpoint for the generator. If it starts with `file://`, a local RDF file is queried. If omitted, the endpoint of the iterator is used."
},
"batchSize": {
"type": "number",
"minimum": 1,
"description": "Overrule the generator's behaviour of fetching results for 10 bindings of $this per request."
"description": "Overrule the generator's behaviour of fetching results for 10 bindings of `$this` per request.",
"default": 10
}
}
}
Expand Down

0 comments on commit 86e6602

Please sign in to comment.