This repository holds source code and test materials for developing various ETL components to power data and metdata extraction, transformation, and loading for the DARPA SD2 program. ETL components are developed and operated in TACC's Cloud API platform, which features the Agave and Reactors application runtimes.
- Agave is platform-as-a-service which is the foundation of several large cyberinfrastructure platforms, including CyVerse and DesignSafe. Designed from the ground up to support reproducible and collaborative science, it provides data management and marshalling, full application lifecycle support, identity management and access control, document store functionality, callback-driven programming, and integration with diverse cloud, hpc, and htc resource types.
- Reactors is a REST-based web service that brings functions-as-a-service to analytical computing. This system is in active development and will be integrated into the ETL process after the Q0 working meeting. More information will be available soon about Reactors.
- The software assets powering each application and ETL process are packaged into a versioned Docker container
- These containers are either derived from SD2E's base images or are constructed to align with operational requirements for the SD2E platform
- Each application is deployed as either an Agave application or a Reactor (or both). The process for doing so is documented in tutorial materials as well as via the working codes found in this repository.
- Applications can be used in the TACC Cloud API via an interactive web workspace, inside Jupyter notebooks (SD2E-hosted or 3rd party), within Python scripts and programs using the AgavePy library, or via an interactive CLI
- Docker 17.X.X-ce
- Python 2.7.10+
- Bash 3.2.57+
- Git 2.12+
- jq 1.5+
- A GitHub account
- A Docker Hub account
- An active SD2E account
- SD2E API User's Guide
- Agave API Developer Docs
- SD2 App ETL Example
- [SD2 App ETL Jupyter Notebook]
- sd2e/base
- ubuntu16 (recommended)
- ubuntu14
- alpine36
- sd2e/python2
- ubuntu16 (recommended)
- ubuntu14
- sd2e/python3
- ubuntu16 (recommended)
- ubuntu14
- Write a clean Dockerfile so there is no question of source code / version provenance. Minimize image size where possible by removing, e.g. source code tarballs and installation directories.
- Design a robust, but small and portable test case to package with the app bundle. Make liberal use of error checking in
tester.sh
andrunner_template.sh
. - Use only command line arguments when calling the containerized executable (with the
container_exec
function). If the executable requires a configuration file, use a wrapper script inside the container to parse inputs from the command line and generate the appropriate configuration file. - Explicitly declare all inputs, and explicitly write all outputs. This includes file name and full path.
- Package and curate outputs into a user-friendly format. Some use cases may benefit from a tarball of all output files; some use cases may benefit from individual files.
- Make output file names deterministic and predictable to facilitate scripting and job chaining.
- Document all expected outputs in the
tester.sh
andrunner-template.sh
wrapper scripts. Where appropriate, validate output and provide helpful error messaging. - Share your Docker images and app bundles with the SD2E community to benefit others and elicit feedback.
Best practices were adapted from the Computational Genomics Lab.