Integrating with EIDF
A key goal of the overarching battery data archive project is to make data available for large-scale analysis. This will be enabled by storing data within Edinburgh International Data Facility systems and providing Galvanalyser data provision services within the same systems.
There are three key components to EIDF integration:
- Sending data to EIDF
- Proces…
A key goal of the overarching battery data archive project is to make data available for large-scale analysis. This will be enabled by storing data within Edinburgh International Data Facility systems and providing Galvanalyser data provision services within the same systems.
There are three key components to EIDF integration:
- Sending data to EIDF
- Processing data within EIDF
- Providing access to data stored in EIDF systems
Galvanalyser will be responsible for aspects of all three of these components.
Requirements
Sending data to EIDF
Galvanalyser should:
- monitor battery data directories
- extract file and dataset metadata
- allow metadata to be checked and edited by authorised users
- send raw files and metadata to 3rd parties (i.e. the EIDF's CKAN ingestion service)
Processing data within EIDF
Galvanalyser should:
- search for unprocessed uploaded files on CKAN
- 'checkout' a file by registering it in a database
- process a file to extract its data into the generic Galvanalyser format
- send processed data to CKAN, linked to the raw data as a related record
Provide access to data
Galvanalyser should:
- allow users to log in via CKAN authorisation service
- provide data overview to authorised users via a web interface
- provide data to authorised users via a Python API
Restructure
To perform the various roles Galvanalyser is called upon to play in the EIDF-integrated setup, Galvanalyser will have to be restructured.
- The current implementation of Galvanalyser is suitable for lab environments where tens of directories are being monitored. For sending data to the EIDF, this complete version of Galvanalyser is more than sufficient. The data processing functionality is optional (and will in fact slow down the process because datasets are not available for metadata editing until their content is processed).
- For processing data on the EIDF, Galvanalyser's harvesters should have a parallel implementation where their inputs and outputs are not monitored directories and the Galvanalyser backend, but the CKAN data lake and a Galvanalyser-style database to which they can write directly. Once files are completely processed, records should be sent to the CKAN data lake and linked to the raw data.
- Note: We'll still need some core stuff here about column types, units, etc. When sending stuff back to the CKAN data lake, it may be necessary to expand some of that generic information to provide complete files.
- To provide access to data, a READ ONLY version of the Galvanalyser REST API and its Python API and web frontend counterparts should be plugged into the EIDF Galvanalyser database.