Authors
Kevin T. Chu <[email protected]>
-
1.2. Directory Structure
-
2.1. Setting Up
2.2. Conventions
2.3. Environment
2.4. Using JupyterLab
This project template is intended to support data science projects that utilize Jupyter notebooks for experimentation and reporting. The design of the template is based on the blog article "Jupyter Notebook Best Practices for Data Science" by Jonathan Whitmore.
Features include:
-
compatible with standard version control software;
-
automatically saves HTML and
*.py
versions of Jupyter notebooks to facilitate review of both (1) data science results and (2) implementation code; -
supports common data science workflows (for both individuals and teams); and
-
encourages separation and decoupling of datasets, R&D work (i.e., Jupyter notebooks), deliverables (i.e., reports), and Python functions and modules refactored from R&D code.
- Python (>=3.7)
- Miniconda
- Required for MLflow Projects and MLflow Models
- Julia (>=1.6)
direnv
README.md
LICENSE
README.md.template
RELEASE-NOTES.md.template
LICENSE.template
requirements.txt
Project.toml
Manifest.toml
bin/
data/
lib/
reports/
research/
template-docs/
template-docs/extras/
-
README.md
: this file (same asREADME-Data-Science-Project-Template.md
in thetemplate-docs
directory) -
LICENSE
: license for Data Science Project Template (same asLICENSE-Data-Science-Project-Template.md
in thetemplate-docs
directory) -
*.template
: template files for the package- Template files are indicated by the
template
suffix and contain template parameters denoted by double braces (e.g.{{ PKG_NAME }}
). Template files are intended to simplify the set up of the package. When used, they should be renamed to remove thetemplate
suffix.
- Template files are indicated by the
-
requirements.txt
:pip
requirements file containing Python packages for project (e.g., data science, testing, and assessing code quality packages) -
Project.toml
: Julia package management file containing Julia package dependencies. It is updated whenever new Julia packages are added via the REPL. This file may be safely removed if Julia is not required for the project. -
Manifest.toml
(generated by Julia): Julia package management file that Julia uses to maintain a record of the state of the Julia environment. This file should not be edited. -
bin
: directory where scripts and programs should be placed -
data
: directory where project data should be placed- Recommendation: data placed in the
data
directory should be managed using DVC (or a similar tool) rather than being included in thegit
repository. This is especially important for projects with large datasets or datasets containing sensitive information. For projects with small datasets that do not contain sensitive information, it may be reasonable to have the data contained in thedata
directory be managed directly bygit
.
- Recommendation: data placed in the
-
lib
: directory containing source code to support the project (e.g., custom code developed for the project, utility modules, etc.) -
reports
: directory containing reports (in any format) that summarize research results. When a report is prepared as a Jupyter notebook, the notebook should be polished, contain final analysis results (not preliminary results), and is usually the work product of the entire data science team. -
research
: directory containing Jupyter notebooks used for research phase work (e.g., exploration and development of ideas, DS/ML experiments). Each Jupyter notebook in this directory should (1) be dated and (2) have the initials of the person who last modified it. When an existing notebook is modified, it should be saved to a new file with a name based on the modification date and initialed by the person who modified the notebook. -
template-docs
: directory containing documentation this package templatetemplate-docs/extras
: directory containing example and template files
- Set up environment for project using only one of the following approaches.
-
direnv
-based setup-
Copy
template-docs/extras/envrc.template
to.envrc
in project root directory. -
Grant permission to
direnv
to execute the.envrc
file.$ direnv allow
-
If needed, edit "User-Specified Configuration Parameters" section of
.envrc
.
-
-
autoenv
-based setup-
Create Python virtual environment.
$ python3 -m venv .venv
-
Copy
template-docs/extras/env.template
to.env
in project root directory. -
If needed, edit "User-Specified Configuration Parameters" section of
.env
.
-
-
Install required Python packages.
-
If using cloud-based storage for DVC, modify the
dvc
line inrequirements.txt
to include the extra packages required to support the cloud-based storage.# DVC with S3 for remote storage dvc[s3] # DVC with Azure for remote storage dvc[azure]
-
Use
pip
to install Python packages.$ pip install -r requirements.txt
-
-
(OPTIONAL) Set up Julia environment.
```shell $ julia julia> ] (...) pkg> instantiate ```
-
(OPTIONAL) Set up DVC.
-
Initialize DVC.
$ dvc init
-
Stop tracking
data
directory withgit
.$ git rm -r --cached 'data' $ git commit -m "Stop tracking 'data' directory" $ rm data/.git-keep-dir
-
-
Rename all of the template files with the
template
suffix removed (overwrite the originalREADME.md
andLICENSE
files) and replace all template parameters with package-appropriate values. -
Clean up project.
- If Julia is not required for the project, remove
Project.toml
from the project.
- If Julia is not required for the project, remove
-
Jupyter notebooks in the
research
directory should be named using the following convention:YYYY-MM-DD-AUTHOR_INITIALS-BRIEF_DESCRIPTION.ipynb
.- Example:
2019-01-17-KC-information_theory_analysis.ipynb
- Example:
-
Depending on the nature of the project, it may be useful to organize notebooks into sub-directories (e.g., by team member, by sub-project).
If direnv
or autoenv
is enabled, the following environment variables are
automatically set.
DATA_DIR
: absolute path todata
directory
-
Launching a JupyterLab.
$ jupyter-lab
-
Use the GUI to create Jupyter notebooks, edit and run Jupyter notebooks, manage files in the file system, etc.
- J. Whitmore. "Jupyter Notebook Best Practices for Data Science" (2016/09).