The Velexi Dataset Cookiecutter is intended to streamline the process of a creating dataset that
-
improves the reproducibility of research analysis (including machine learning experiments, data science analysis, and traditional scientific and engineering studies) by applying version control principles to datasets,
-
facilitates efficient exploration of datasets by standardizing the directory structure used to organize data,
-
increases reuse of datasets across projects by decoupling datasets from data analysis code, and
-
simplifies dataset maintenance by keeping dataset management code (e.g., clean up scripts) with the dataset.
-
A simple, consistent dataset directory structure
-
Quick references for dataset maintenance tools (e.g., FastDS)
-
Pre-configured to support development of Python software tools
-
Integration with code and data quality tools (e.g., pre-commit)
-
2.1. License
2.2. Repository Contents
2.4. Setting Up to Develop the Cookiecutter
2.5. Additional Notes
-
dataset_name
: dataset nname -
author
: dataset's primary author (or maintainer) -
email
: : primary author's (or maintainer's) email -
dataset_license
: type of license to use for the dataset -
software_license
: type of license to use for supporting software -
python_version
: Python versions compatible with project. See the "Dependency sepcification" section of the Poetry documentation for version specifier semantics.
-
Prerequisites
-
Install Git.
-
Install Python 3.9 (or greater).
-
Install Poetry 1.2 (or greater).
Note. The dataset template uses
poetry
instead ofpip
for management of Python package dependencies. -
Install the Cookiecutter Python package.
-
Optional. Install direnv.
-
-
Use
cookiecutter
to create a new Python project.$ cookiecutter https://github.com/velexi-research/VLXI-Cookiecutter-Dataset.git
-
Set up dedicated virtual environment for the project. Any of the common virtual environment options (e.g.,
venv
,direnv
,conda
) should work. Below are instructions for setting up adirenv
orpoetry
environment.Note: to avoid conflicts between virtual environments, only one method should be used to manage the virtual environment.
-
direnv
Environment. Note:direnv
manages the environment for both Python and the shell.-
Prerequisite. Install
direnv
. -
Copy
extras/dot-envrc
to the project root directory, and rename it to.envrc
.$ cd $PROJECT_ROOT_DIR $ cp extras/dot-envrc .envrc
-
Grant permission to direnv to execute the .envrc file.
$ direnv allow
-
-
poetry
Environment. Note:poetry
only manages the Python environment (it does not manage the shell environment).-
Create a
poetry
environment that uses a specific Python executable. For instance, ifpython3
is on yourPATH
, the following command creates (or activates if it already exists) a Python virtual environment that usespython3
for the project.$ poetry env use python3
For commands to use other Python executables for the virtual environment, see the Poetry Quick Reference.
-
-
-
Install the Python package dependencies (e.g., pre-commit, DVC, FastDS).
$ poetry install
-
Configure Git.
-
Install the Git pre-commit hooks.
$ pre-commit install
-
Optional. Set up a remote Git repository (e.g., GitHub repository).
-
Create a remote Git repository.
-
Configure the remote Git repository.
$ git remote add origin GIT_REMOTE
where
GIT_REMOTE
is the URL of the remote Git repository. -
Push the
main
branch to the remote Git repository.$ git checkout main $ git push -u origin main
-
-
-
Optional. Configure remote storage for DVC (e.g., an AWS S3 bucket).
-
Create remote storage for the dataset. Below are instructions for setting up a storage on the local file system or AWS S3.
-
Local File System. Create a directory that DVC can use to store a copy of the dataset (outside of the working dataset directory).
-
AWS S3. Create an S3 bucket that DVC can use for remote storage.
-
-
Configure the remote DVC storage for the dataset.
$ dvc remote add -d origin DVC_REMOTE $ fds commit "Add DVC remote storage."
where
DVC_REMOTE
is the URL of the remote storage for the dataset (e.g., the path to a directory on the local file system or the URL to the S3 bucket).Important Note. The name of the remote storage must be set to "origin" if
fds
is used to push data to remote storage. If the name is not set to "origin",dvc
must be used to push data to remote storage. -
Configure the credentials that DVC should use to connect to remote storage. Note: the
--local
option ensures that these DVC configurations are stored in a local configuration file (.dvc/config.local
) that should not be committed to the Git repository.-
AWS S3. Set the AWS profile, AWS access keys, or credentials file. See DVC Documentation: Amazon S3 and Compatible Servers for more options. For example, to set the AWS profile (to use for accessing the "origin" remote storage), use the following command.
$ dvc remote modify --local origin profile AWS_PROFILE
where
AWS_PROFILE
is the AWS profile that should be used to access S3.
-
-
-
Finish setting up the new dataset.
-
Verify the copyright year and owner in the copyright notice.
Note. If the software components of the dataset are licensed under Apache License 2.0, the software copyright notice is located in the
NOTICE
file. Otherwise, the software copyright notice is located in theLICENSE
file. -
Update the Python package dependencies to the latest available versions.
$ poetry update
-
Fill in any empty fields in
pyproject.toml
. -
Customize the
README.md
file to reflect the specifics of the dataset. -
Commit all updated files (e.g.,
poetry.lock
) to the dataset Git repository.
-
When the dataset includes data from third-party sources, be sure to include
a reference to the source and license information (if available) in the
DATASET-NOTICE
file.
The cookiecutter currently only supports two DVC remote storage providers: (1) AWS S3 and (2) the local file system. To use one of the other remote storage providers supported by DVC, use the following steps.
-
Select
None
whencookiecutter
prompts you for thedvc_remote_storage_provider
. -
Add the optional dependencies of the
dvc
Python package that are required for the DVC remote storage type. For instance, to install the packages for supporting Microsoft Azure, use$ poetry add dvc[azure]
-
Follow Step #6 from Section #1.2 using your choice of DVC remote storage.
The contents of this cookiecutter are covered under the Apache License 2.0
(included in the LICENSE
file). The copyright for this cookiecutter is
contained in the NOTICE
file.
├── README.md <- this file
├── RELEASE-NOTES.md <- cookiecutter release notes
├── LICENSE <- cookiecutter license
├── NOTICE <- cookiecutter copyright notice
├── cookiecutter.json <- cookiecutter configuration file
├── pyproject.toml <- Python project metadata file for cookiecutter
│ development
├── poetry.lock <- Poetry lockfile
├── docs/ <- cookiecutter documentation
├── extras/ <- additional files that may be useful for
│ cookiecutter development
├── hooks/ <- cookiecutter scripts that run before and/or
│ after project generation
├── spikes/ <- experimental code
└── {{cookiecutter.name}}/ <- cookiecutter template
See [tool.poetry.dependencies]
section of pyproject.toml
.
-
Set up a dedicated virtual environment for cookiecutter development. See Step 3 from Section 2.1 for instructions on how to set up
direnv
andpoetry
environments. -
Install the Python packages required for development.
$ poetry install
-
Install the Git pre-commit hooks.
$ pre-commit install
-
Make the cookiecutter better!
To update the Python dependencies for the template (contained in the
{{cookiecutter.__project_name}}
directory), use the following procedure to
ensure that package dependencies for developing the non-template components
of the cookiecutter do not interfere with package dependencies for the
template.
-
Create a local clone of the cookiecutter Git repository to use for cookiecutter development.
$ git clone [email protected]:velexi-research/VLXI-Cookiecutter-Dataset.git
-
Use
cookiecutter
from the local cookiecutter Git repository to create a clean project for template dependency updates.$ cookiecutter PATH/TO/LOCAL/REPO
-
In the pristine project, perform the following steps to update the template's package dependencies.
-
Set up a virtual environment for developing the template (e.g., a direnv environment).
-
Use
poetry
or manually editpyproject.toml
to (1) make changes to the package dependency list and (2) update the package dependency versions. -
Use
poetry
to update the package dependencies and versions recorded in thepoetry.lock
file.
-
-
Update
{{cookiecutter.__project_name}}/pyproject.toml
.-
Copy
pyproject.toml
from the pristine project to{{cookiecutter.__project_name}}/pyproject.toml
. -
Restore the templated values in the
[tool.poetry]
section to the following:[tool.poetry] name = "{{ cookiecutter.__project_name }}" version = "0.1.0" description = "" license = "{% if cookiecutter.license == 'Apache License 2.0' %}Apache-2.0{% elif cookiecutter.license == 'BSD-3-Clause License' %}BSD-3-Clause{% elif cookiecutter.license == 'MIT License' %}MIT{% endif %}" readme = "README.md" authors = ["{{ cookiecutter.author }} <{{ cookiecutter.email }}>"]
-
-
Update
{{cookiecutter.__project_name}}/poetry.lock
.- Copy
poetry.lock
from the pristine project to{{cookiecutter.__project_name}}/poetry.lock
.
- Copy
-
Commit the updated
pyproject.toml
andpoetry.lock
files to the Git repository.