This repository contains a library for Extract, Transform and Load processes for ResourceProjects.org.
You can report issues with current transformations, or suggest sources which should be added to this library using the GitHub issue tracker.
Each process, located in the process folder consists of a collection of files that either (a) document a manual transformation of the data; or (b) perform an automated transformation.
Folders may contain:
- A README.md file describing the transformation
- An extract.sh or extract.py file to fetch the file
- A data/ subfolder where the extracted data is stored during conversion (ignored by git)
- A transform.py file which runs the transformations
- A meta.json file, containing the meta-data which transform.py will use
- A prov.ttl file containing provenance information (using PROV-O) to be merged into the final graph
The output of each process should be written to the root data/ folder, from where it can be loaded onto the ResourceProjects.org platform.
The project acts as an extension module for https://github.com/OpenDataServices/cove. First clone and test that project before using this one.
Then, assuming you have a common folder for the two git clones (i.e. this repo can be found at ../resource-projects-etl relative to cove), perform these steps from within the cove folder:
cp ../resource-projects-etl/requirements_taglifter.txt ./
pip install -r requirements_taglifter.txt
cp ../resource-projects-etl/requirements.txt ./
pip install -r requirements.txt
cp -R ../resource-projects-etl/modules ./
cp -R ../resource-projects-etl/setup.py ./
python setup.py install
mkdir db
export DB_NAME=./db.sqlite
export DJANGO_SETTINGS_MODULE=settings
python manage.py migrate --noinput
python manage.py compilemessages
python manage.py collectstatic --noinput
cp -R ../resource-projects-etl/ontology ./ontology
gunicorn cove.wsgi -b 0.0.0.0:8000 --timeout 600 -w 3 -k eventlet
You will need virtuoso container running.
docker rm -f rp-etl
docker run --name rp-etl --link virtuoso:virtuoso -p 127.0.0.1:8000:80 -e DBA_PASS=dba opendataservices/resource-projects-etl
Update DBA_PASS as appropriate.
Then visit http://locahost:8000/
OpenDataServices dev deploy can be found at https://github.com/OpenDataServices/opendataservices-deploy/blob/master/salt/resource-projects.sls (this is a SaltStack state file).
For a live deploy, running docker directly (you probably don't want to do this, but the below commands should be translatable to your preferred deployment approach), you could do:
# Create the volume containers
docker create --name virtuoso-data -v /usr/local/var/lib/virtuoso/db opendataservices/virtuoso:live
docker create --name etl-data -v /usr/src/resource-projects-etl/db -v /usr/src/resource-projects-etl/src/cove/media opendataservices/resource-projects-etl:live
# Run the containers
# Virtuoso
docker run -p 127.0.0.1:8890:8890 --volumes-from virtuoso-data --name virtuoso opendataservices/virtuoso:live
# ETL
docker run -p 127.0.0.1:8001:80 --link virtuoso:virtuoso -e "DBA_PASS=dba" -e FRONTEND_LIVE_URL=http://resourceprojects.org/ -e FRONTEND_STAGING_URL=http://staging.resourceprojects.org/ --volumes-from etl-data opendataservices/resource-projects-etl:live
# Frontend (Live)
docker run -p 127.0.0.1:8080:80 --link virtuoso:virtuoso-live -e BASE_URL=http://resourceprojects.org/ -e SPARQL_ENDPOINT=http://virtuoso-live:8890/sparql -e DEFAULT_GRAPH_URI=http://resourceprojects.org/data/ opendataservices/resourceprojects.org-frontend:live
# Frontend (Staging)
docker run -p 127.0.0.1:8081:80 --link virtuoso:virtuoso-staging -e BASE_URL=http://staging.resourceprojects.org/ -e SPARQL_ENDPOINT=http://virtuoso-staging:8890/sparql -e DEFAULT_GRAPH_URI=http://staging.resourceprojects.org/data/ opendataservices/resourceprojects.org-frontend:live
# Perform initial virtuoso setup
# (this needs running from the directory containing `virtuoso_setup.sql`)
cat virtuoso_setup.sql | docker run --link virtuoso:virtuoso -i --rm opendataservices/virtuoso:live isql virtuoso
If BASE_URL
does not match the URL the sites are exposed at, site navigation won't work correctly. Similarly for the etl container, FRONTEND_LIVE_URL
and FRONTEND_DEV_URL
should be relevant deployed urls.
On the other hand, SPARQL_ENDPOINT
, DEFAULT_GRAPH_URI
and the contents of virtuoso_setup.sql
, should be left exactly as they are here. (SPARQL_ENDPOINT
relates to urls that are wired up inside the docker container by --link, whereas DEFAULT_GRAPH_URI
and the contents of virtuoso_setup.sql
are virtuoso's internal URI's, and don't relate to the URL the site is actually accessible at).
The above commands expose on 8890, 8801, 8080 and 8081 on localhost. Edit these to match what you want, or place a reverse proxy in front of them.
You should update the virtuoso admin password - through the virutoso HTTP user interface, and then in the DBA_PASS environment variable passed to the ETL container.
To get more recent builds than live, replace :live
with :master
in the above.
Run this against the etl container (you will need to replace etl
with the name of your conatiner):
docker exec etl manage.py migrate
docker run --volumes-from etl-data -v $(pwd):/backup opendataservices/virtuoso:master tar cvzf /backup/etl-data.tar.gz /usr/src/resource-projects-etl/db /usr/src/resource-projects-etl/src/cove/media
Restore:
docker run -it --volumes-from etl-data -v $(pwd):/backup opendataservices/virtuoso:master tar xvzf /backup/etl-data.tar.gz -C /
docker build -t opendataservices/resource-projects-etl .
Then run as described above. (You may want to use a different name for your own image, so as not to get confused with those actually from docker hub).
- Python 3
- Bash
virtualenv .ve --python=/usr/bin/python3
source .ve/bin/activate
pip install -r requirements.txt
./transform_all.sh
You will then have some data as Turtle in the data/ directory.
Copyright (c) 2015 Natural Resource Governance Institute
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Note that some parts of the ETL tooling depend on CoVE, which is licensed under the AGPLv3, so must be used in accordance with that license.