Skip to content

Commit

Permalink
metadata augmentation update functionality and foreseen functionality
Browse files Browse the repository at this point in the history
  • Loading branch information
DajanaSnopkova committed Sep 27, 2024
1 parent 412398a commit 4c347c4
Showing 1 changed file with 27 additions and 25 deletions.
52 changes: 27 additions & 25 deletions tech/docs/technical_components/metadata_augmentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,49 +19,32 @@ In this component scripting / NLP / LLM are used on a metadata record to augment
For the first SoilWise prototype, the functionality of the Metadata Augmentation component comprises:

- [Automatic metadata generation](#automatic-metadata-generation)
- [Spatial scope analyser](#spatial-scope-analyser)
- [Translation module](#translation-module)


### Automatic metadata generation

To generate metadata (data set and service metadata), activate the corresponding button(s) when setting up the theme for the transformation process. The steps are described [here](https://main.soilwise-documentation.pages.dev/technical_components/metadata_validation/#setting-up-a-transformation-process-in-haleconnect)

### Spatial scope analyser

A script that analyses the spatial scope of a resource
### Translation module

The bounding box is matched to country bounding boxes
Many records arrive in a local language, SWR translates the main properties for the record: title and abstract into English, to offer a single language user experience. The translations are used in filtering and display of records.

To understand if the dataset has a global, continental, national or regional scope
The translation module builds on the EU translation service (API documentation at <https://language-tools.ec.europa.eu/>). Translations are stored in a database for reuse by the SWR.
The EU translation returns asynchronous responses to translation requests, this means that translations may not yet be available after initial load of new data. A callback operation populates the database, from that moment a translation is available to SWR. The translation service uses 2-letter language codes, it means a translation from a 3-letter iso code (as used in for example iso19139:2007) to 2-letter code is required. The EU translation service has a limited set of translations from a certain to alternative language available, else returns an error.

- Retrieves all datasets (as iso19139 xml) from database (records table joined with augmentations) which:
- have a bounding box
- no spatial scope
- in iso19139 format
- For each record it compares the boundingbox to country bounding boxes:
- if bigger then continents > global
- If matches a continent > continental
- if matches a country > national
- if smaller > regional
- result is written to as an augmentation in a dedicated table
Initial translation is triggered by a running harvester. The translations will then be available once the record is ingested to the triplestore and catalogue database in a followup step of the harvester.

## Foreseen functionality

In the next iterations, Metadata augmentation component is foreseen to include the following additional functions:

- [Translation module](#translation-module)
- [Keyword matcher](#keyword-matcher)
- [Spatial Locator](#spatial-locator)
- [Spatial scope analyser](#spatial-scope-analyser)
- [EUSO-high-value dataset tagging](#euso-high-value-dataset-tagging)

### Translation module

Many records arrive in a local language, SWR translates the main properties for the record: title and abstract into English, to offer a single language user experience. The translations are used in filtering and display of records.

The translation module builds on the EU translation service (API documentation at <https://language-tools.ec.europa.eu/>). Translations are stored in a database for reuse by the SWR.
The EU translation returns asynchronous responses to translation requests, this means that translations may not yet be available after initial load of new data. A callback operation populates the database, from that moment a translation is available to SWR. The translation service uses 2-letter language codes, it means a translation from a 3-letter iso code (as used in for example iso19139:2007) to 2-letter code is required. The EU translation service has a limited set of translations from a certain to alternative language available, else returns an error.

Initial translation is triggered by a running harvester. The translations will then be available once the record is ingested to the triplestore and catalogue database in a followup step of the harvester.


### Keyword matcher

Expand All @@ -81,6 +64,25 @@ For metadata records which have not been analysed yet (in that iteration), the m
Analyses existing keywords to find a relevant geography for the record, it then uses the [GeoNames](https://www.geonames.org/about.html){target=_blank} API to find spatial coordinates for the geography, which are inserted into the metadata record.


### Spatial scope analyser

A script that analyses the spatial scope of a resource

The bounding box is matched to country bounding boxes

To understand if the dataset has a global, continental, national or regional scope

- Retrieves all datasets (as iso19139 xml) from database (records table joined with augmentations) which:
- have a bounding box
- no spatial scope
- in iso19139 format
- For each record it compares the boundingbox to country bounding boxes:
- if bigger then continents > global
- If matches a continent > continental
- if matches a country > national
- if smaller > regional
- result is written to as an augmentation in a dedicated table

### EUSO-high-value dataset tagging

The EUSO high-value datasets are those with substantial potential to assess soil health status, as detailed on the [EUSO dashboard](https://esdac.jrc.ec.europa.eu/esdacviewer/euso-dashboard/){target=_blank}. This framework includes the concept of [soil degradation indicator](https://esdac.jrc.ec.europa.eu/content/soil-degradation-indicators-eu){target=_blank} metadata-based identification and tagging. Each dataset (possibly only those with the supra-national spatial scope - under discussion) will be annotated with a potential soil degradation indicator for which it might be utilised. Users can then filter these datasets according to their specific needs.
Expand Down

0 comments on commit 4c347c4

Please sign in to comment.