Velexi Template: Data Science Project (v0.2.0)

Authors
Kevin T. Chu <[email protected]>

1. Overview

This project template is intended to support data science projects that utilize Jupyter notebooks for experimentation and reporting. The design of the template is based on the blog article "Jupyter Notebook Best Practices for Data Science" by Jonathan Whitmore.

Features include:

compatible with standard version control software;
automatically saves HTML and *.py versions of Jupyter notebooks to facilitate review of both (1) data science results and (2) implementation code;
supports common data science workflows (for both individuals and teams); and
encourages separation and decoupling of datasets, R&D work (i.e., Jupyter notebooks), deliverables (i.e., reports), and Python functions and modules refactored from R&D code.

1.1. Software Dependencies

Base Requirements

Python (>=3.7)

Optional Packages

Miniconda
- Required for MLflow Projects and MLflow Models
Julia (>=1.6)
direnv

1.2. Directory Structure

README.md
LICENSE
README.md.template
RELEASE-NOTES.md.template
LICENSE.template
requirements.txt
Project.toml
Manifest.toml
bin/
data/
lib/
reports/
research/
template-docs/
template-docs/extras/

README.md: this file (same as README-Data-Science-Project-Template.md in the template-docs directory)
LICENSE: license for Data Science Project Template (same as LICENSE-Data-Science-Project-Template.md in the template-docs directory)
*.template: template files for the package
- Template files are indicated by the template suffix and contain template parameters denoted by double braces (e.g. {{ PKG_NAME }}). Template files are intended to simplify the set up of the package. When used, they should be renamed to remove the template suffix.
requirements.txt: pip requirements file containing Python packages for project (e.g., data science, testing, and assessing code quality packages)
Project.toml: Julia package management file containing Julia package dependencies. It is updated whenever new Julia packages are added via the REPL. This file may be safely removed if Julia is not required for the project.
Manifest.toml (generated by Julia): Julia package management file that Julia uses to maintain a record of the state of the Julia environment. This file should not be edited.
bin: directory where scripts and programs should be placed
data: directory where project data should be placed
- Recommendation: data placed in the data directory should be managed using DVC (or a similar tool) rather than being included in the git repository. This is especially important for projects with large datasets or datasets containing sensitive information. For projects with small datasets that do not contain sensitive information, it may be reasonable to have the data contained in the data directory be managed directly by git.
lib: directory containing source code to support the project (e.g., custom code developed for the project, utility modules, etc.)
reports: directory containing reports (in any format) that summarize research results. When a report is prepared as a Jupyter notebook, the notebook should be polished, contain final analysis results (not preliminary results), and is usually the work product of the entire data science team.
research: directory containing Jupyter notebooks used for research phase work (e.g., exploration and development of ideas, DS/ML experiments). Each Jupyter notebook in this directory should (1) be dated and (2) have the initials of the person who last modified it. When an existing notebook is modified, it should be saved to a new file with a name based on the modification date and initialed by the person who modified the notebook.
template-docs: directory containing documentation this package template
- template-docs/extras: directory containing example and template files

2. Usage

2.1. Setting Up

Set up environment for project using only one of the following approaches.

direnv-based setup
- Copy template-docs/extras/envrc.template to .envrc in project root directory.
- Grant permission to direnv to execute the .envrc file.
```
$ direnv allow
```
- If needed, edit "User-Specified Configuration Parameters" section of .envrc.
autoenv-based setup
- Create Python virtual environment.
```
$ python3 -m venv .venv
```
- Copy template-docs/extras/env.template to .env in project root directory.
- If needed, edit "User-Specified Configuration Parameters" section of .env.

Install required Python packages.
- If using cloud-based storage for DVC, modify the dvc line in requirements.txt to include the extra packages required to support the cloud-based storage.
```
# DVC with S3 for remote storage
dvc[s3]

# DVC with Azure for remote storage
dvc[azure]
```
- Use pip to install Python packages.
```
$ pip install -r requirements.txt
```

(OPTIONAL) Set up Julia environment.

```shell
$ julia

julia> ]

(...) pkg> instantiate
```

(OPTIONAL) Set up DVC.

Initialize DVC.
```
$ dvc init
```

Stop tracking data directory with git.

$ git rm -r --cached 'data'
$ git commit -m "Stop tracking 'data' directory"
$ rm data/.git-keep-dir

Rename all of the template files with the template suffix removed (overwrite the original README.md and LICENSE files) and replace all template parameters with package-appropriate values.
Clean up project.
- If Julia is not required for the project, remove Project.toml from the project.

2.2. Conventions

`research` directory

Jupyter notebooks in the research directory should be named using the following convention: YYYY-MM-DD-AUTHOR_INITIALS-BRIEF_DESCRIPTION.ipynb.
- Example: 2019-01-17-KC-information_theory_analysis.ipynb
Depending on the nature of the project, it may be useful to organize notebooks into sub-directories (e.g., by team member, by sub-project).

2.3. Environment

If direnv or autoenv is enabled, the following environment variables are automatically set.

DATA_DIR: absolute path to data directory

2.4. Using JupyterLab

Launching a JupyterLab.
```
$ jupyter-lab
```
Use the GUI to create Jupyter notebooks, edit and run Jupyter notebooks, manage files in the file system, etc.

3. References

J. Whitmore. "Jupyter Notebook Best Practices for Data Science" (2016/09).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Velexi Template: Data Science Project (v0.2.0)

Table of Contents

1. Overview

1.1. Software Dependencies

Base Requirements

Optional Packages

1.2. Directory Structure

2. Usage

2.1. Setting Up

2.2. Conventions

`research` directory

2.3. Environment

2.4. Using JupyterLab

3. References

About

Releases

Sponsor this project

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.jupyter		.jupyter
bin		bin
data		data
lib		lib
reports		reports
research		research
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

velexi-research/INTERNSHIP-2021-RSI-HT

Folders and files

Latest commit

History

Repository files navigation

Velexi Template: Data Science Project (v0.2.0)

Table of Contents

1. Overview

1.1. Software Dependencies

Base Requirements

Optional Packages

1.2. Directory Structure

2. Usage

2.1. Setting Up

2.2. Conventions

research directory

2.3. Environment

2.4. Using JupyterLab

3. References

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 2

Languages

`research` directory

Packages