The "geohub-data-pipeline" is a service that receives the filepath of an Azure blob from a service bus queue from uploads to the Geohub platform and converts them to Cloud-Optimized GeoTIFF (COG) or PMTiles format and stores them in an Azure Blob storage container. This project is designed to be run in a Docker container on AKS as a deployment.
To use this API, you'll need to have:
- An Azure account with access to an Azure Blob storage container for the UNDP Data Geohub
- Docker installed on your local machine
- An AKS cluster set up with kubectl configured to connect to it
- Clone this repository to your local machine.
- Navigate to the project directory in your terminal or command prompt.
- Any changes pushed to this repo will automatically trigger a build on the UNDP Data Azure Container Registry.
To configure the API, you'll need to create an .env
file in the deployments/scripts
directory.
Here are the configuration options:
AZURE_STORAGE_CONNECTION_STRING
SERVICE_BUS_CONNECTION_STRING
SERVICE_BUS_QUEUE_NAME
To use the API, follow these steps:
- Deploy the Docker container to your AKS cluster by running the command
deployments/scripts/install.sh
. - Upload a file on the Geohub dev app.
- A message will be sent to the service bus queue with the filepath of the uploaded file, and the user token for authentication.
- The API will determine whether the file is a raster or vector file, then convert it to a COG or PMTiles format and store it in the user's "datasets" directory in Azure.
To test the app locally, you can use Docker Compose to build and run the app in a local development environment.
-
Clone the repository to your local machine.
git clone https://github.com/UNDP-Data/geohub-data-pipeline.git
-
Navigate to the cloned repository.
cd geohub-data-pipeline
-
Copy the
.env
file in to the root directory of the repository. Copy.env.example
to create your.env
. This file will contain the environment variables that are used by the app. The following variables are required:
AZURE_STORAGE_CONNECTION_STRING
SERVICE_BUS_CONNECTION_STRING
AZURE_WEBPUBSUB_CONNECTION_STRING
-
Build and run the app using Docker Compose.
docker-compose -f ./dockerfiles/docker-compose.yaml up --build
This command will build the app's Docker image and start the app in a Docker container. The
--build
flag ensures that the image is rebuilt whenever there are changes to the app's code. -
Once the container is running, it will automatically begin scanning for messages in the service bus queue. If a message is found, it will be processed and the file will be converted to a COG or PMTiles format and stored in the user's "datasets" directory in Azure.
-
To stop the app, use the following command:
docker-compose down
This command will stop and remove the Docker container.
Note: This is only a local development environment. For production deployment, you should use a production-grade Docker image and deploy it to a production environment.
While the pipeline was designed to be executed within and leverage cloud environment, sometimes it may be desirable to run it as a command line tool. In general the best option ot run GDAL dependent python software is through docker. This is to avoid potential issues that could arise from GDAL dependencies and version mismatch.
An viable alternative would be to create a setup.py/pyproject.toml but that assumes the user has a decent understanding of GDAL internals.
For the sake of simplicity we choose to rely on docker.
-
Clone the repository to your local machine.
git clone https://github.com/UNDP-Data/geohub-data-pipeline.git
-
Navigate to the cloned repository.
cd geohub-data-pipeline
-
Build the docker image and assign it a meaningful tag
docker build . -t geohub-data-pipeline
-
Run the command line version
docker run --rm -it -v /home/janf/Downloads/data:/data geohub-data-pipeline python -m ingest.cli.main -src /data/Sample.gpkg -dst /data/out/
The above command will run the previously built image in a new container, will remove it afterwards, maps the "/home/janf/Downloads/data" folder as /data folder inside the container and finally call the command line version of the pipeline to convert all vector layers to PMTiles and raster bands to COG format
if the command line script is called without any args the help is displayed
docker run --rm -it -v /home/janf/Downloads/data:/data geohub-data-pipeline python -m ingest.cli.main
/usr/local/lib/python3.10/dist-packages/rio_cogeo/profiles.py:182: UserWarning: Non-standard compression schema: zstd. The output COG might not be fully supported by software not build against latest libtiff.
warnings.warn(
usage: main.py [-h] [-src SOURCE_FILE] [-dst DESTINATION_DIRECTORY] [-j] [-d]
Convert layers/bands from GDAL supported geospatial data files to COGs/PMtiles.
options:
-h, --help show this help message and exit
-src SOURCE_FILE, --source-file SOURCE_FILE
A full absolute path to a geospatial data file. (default: None)
-dst DESTINATION_DIRECTORY, --destination-directory DESTINATION_DIRECTORY
A full absolute path to a folder where the files will be written. (default: None)
-j, --join-vector-tiles
Boolean flag to specify whether to create a multilayer PMtiles filein case the source file contains more then one vector layer (default: False)
-d, --debug Set log level to debug (default: False)
- What geospatial formats are supported? ANSWER: anything GDAL supports. Beware the image uses latest GDAL image (3.7)
- What does the tool do?
ANSWER:
- converts all vector layers to PMtiles format
- optionally create a multilayer PMtiles in case join argument is set to True
- convert all raster bands or RGB rasters into COG, featuring zstd compression and Google Web mercator projection
- Are any environmental variables needed? ANSWER: NO