Skip to content

Commit

Permalink
Merge pull request #24 from NRGI/taglifter
Browse files Browse the repository at this point in the history
Merge in work based on Tim's taglifter library.

Includes multiple transform scripts, and a transform/load a workflow that works entirely within docker containers.
  • Loading branch information
Bjwebb committed Jul 2, 2015
2 parents f2d6395 + 3df96e4 commit 0e9f3eb
Show file tree
Hide file tree
Showing 52 changed files with 10,046 additions and 21,711 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
__pycache__
data
data/*
*.swp
*~
.ve
.ipynb_checkpoints
process/*/data
ontology/catalog-v001.xml
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
FROM python:3.4-onbuild
CMD "./transform_all.sh"
39 changes: 35 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,45 @@
# resource-projects-etl
Extract, Transform and Load processes for rp.org

# Requirements
This repository contains a library for Extract, Transform and Load processes for ResourceProjects.org.

Python 3
You can report issues with current transformations, or suggest sources which should be added to this library using the GitHub issue tracker.

# Getting started

## Processes
Each process, located in the **process** folder consists of a collection of files that either (a) document a manual transformation of the data; or (b) perform an automated transformation.

Folders may contain:

* A README.md file describing the transformation
* An extract.sh or extract.py file to fetch the file
* A data/ subfolder where the extracted data is stored during conversion (ignored by git)
* A transform.py file which runs the transformations
* A meta.json file, containing the meta-data which transform.py will use
* A prov.ttl file containing provenance information (using [PROV-O](www.w3.org/TR/prov-o)) to be merged into the final graph

The output of each process should be written to the root /data/ folder, from where it can be loaded onto the ResourceProjects.org platform.



## Requirements

* Python 3
* Bash

### Getting started

```
virtualenv .ve --python=/usr/bin/python3
source .ve/bin/activate
pip install -r requirements.txt
```

### Running with docker

```
docker rm -f rp-etl rp-load
docker run --name rp-etl -v /usr/src/app/data -v /usr/src/app/ontology bjwebb/resource-projects-etl
docker run --name rp-load --link virtuoso:virtuoso --volumes-from virtuoso --volumes-from rp-etl --rm bjwebb/resource-projects-etl-load
```

To run the last command you will need [virtuoso container running](https://github.com/NRGI/resourceprojects.org-frontend/#pre-requisites).
61 changes: 0 additions & 61 deletions create_rdf.py

This file was deleted.

4 changes: 4 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Ignore everything in this directory
*
# Except this file
!.gitignore
1 change: 0 additions & 1 deletion data/indonesia/3-openoil-concessions-indonesia.csv

This file was deleted.

57 changes: 0 additions & 57 deletions data_to_pandas.py

This file was deleted.

82 changes: 0 additions & 82 deletions disambig.py

This file was deleted.

1 change: 0 additions & 1 deletion eiti-project-level.csv

This file was deleted.

4 changes: 4 additions & 0 deletions load/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FROM caprenter/automated-build-virtuoso
ADD load.sh /load.sh
ADD import.sql /import.sql
CMD /load.sh
11 changes: 11 additions & 0 deletions load/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Docker container definition for loading data into virtuoso.

You will need a [virtuoso container running](https://github.com/NRGI/resourceprojects.org-frontend/#pre-requisites).

Then from this directory:

```
docker build -t rp-load .
cd ..
docker run --name rp-load --link virtuoso:virtuoso --volumes-from virtuoso -v `pwd`/data:/data --rm rp-load
```
4 changes: 4 additions & 0 deletions load/import.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SPARQL CLEAR GRAPH <http://resourceprojects.org/>;
delete from db.dba.load_list;
ld_dir_all('/usr/local/var/lib/virtuoso/db/import', '*', 'http://resourceprojects.org/');
rdf_loader_run();
6 changes: 6 additions & 0 deletions load/load.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
cd /usr/local/var/lib/virtuoso/db/
rm -r import
mkdir import
cp /usr/src/app/data/* import
cp /usr/src/app/ontology/*.rdf import
isql virtuoso dba dba /import.sql
Loading

0 comments on commit 0e9f3eb

Please sign in to comment.