Interactive and visual workflow of active learning

Supervised machine learning, while powerful, needs labeled data to be effective. Active learning reduces the number of labeled examples needed to train a model, saving time and money while obtaining comparable performance to models trained with much more data.

This application is meant to serve as a complement to the prototype for the report we released on Learning with Limited Labeled Data. To build an intuition for why active learning works, please see the report or our blogpost - A guide to learning with limited labeled data

What is Active Learning?

Active learning is an iterative process that relies on human input to build up a smartly labeled dataset. The process typically looks like this:

Begin with a small set of labeled data (that is all we have)
Train a model
Use the trained model to run inference on a much larger pool of unlabeled data available to us
Use a selection strategy to pick out points that are difficult for the machine to predict correctly
Request labels from human for those difficult points
Add these examples back to the labeled dataset and retrain on the expanded dataset
Repeat the training and labeling process until you have achieved your desired model performance

We use the MNIST dataset to illustrate the active learning workflow. To start, we train a convolutional neural network using only a few labeled datapoints. You can kick off training by selecting appropriate hyperparameters and clicking on the "TRAIN" button. Once the model has completed training, you can visualize the embeddings of the convolutional neural network by using UMAP to project the high dimensional representation down to 2D.

Structure

.
├── activelearning        # active learning scripts
├── apps                  # Dash application
├── assets                # Dash related stylesheets
├── cml                   # contains scripts that facilitate the project launch on CML
├── data                  # contains MNIST data.
├── docs                  # images/ snapshots for README
├── experiments           # contains a script that demonstrates the use of AL functions

The assets, data, and docs directories are unimportant and can be ignored.

`activelearning`

activelearning
├── data.py               # loads train and validation data
├── dataset.py            # MNIST dataset handler
├── model.py              # Neural network architecture
├── sample.py             # defines random, entropy and entropy dropout selection strategies
└── train.py              # helper functions for model training, generating embeddings and predictions, computing metrics, checkpointing, saving results in text file and so on

`apps`

apps
├── app.py
├── demo.py 
├── demo_description.md
├── demo_intro.md

These scripts leverage the stylesheets from the assets folder to provide a UI for:

training a model based on the user selected hyperparameters,
visualize the trained embeddings using UMAP
visualize model performance metrics on trains and validation sets
and request labels for 10 selected datapoints from the user to retrain the model
the final model is saved in the "/models" directory for future use

`experiments`

experiments
├── main.py               # code to experiment and test data and model functions w/o UI

Note: You still need to go through the installation process below to be able to run this code

`data`

data
├── MNIST

The project uses the MNIST dataset to illustrate the AL workflow. It uses only the 10k testing dataset as the entire dataset.

First set aside 2,000 datapoints for validation/testing
Out of the remaining 8,000 datapoints, allow the user to select 100, 500 or 1,000 as initial labeled examples while the rest are unlabeled.
The user can then provide labels for the 10 shortlisted examples (based on the selection strategy) from the remaining training examples and continue to train a model with the additional data points.
In the long run the model performance differs based on the selection strategy employed.

Deploying on CML

There are three ways to launch this project on CML:

From Prototype Catalog - Navigate to the Prototype Catalog on a CML workspace, select the "Active Learning" tile, click "Launch as Project", click "Configure Project"
As ML Prototype - In a CML workspace, click "New Project", add a Project Name, select "ML Prototype" as the Initial Setup option, copy in the repo URL, click "Create Project", click "Configure Project"
Manual Setup - In a CML workspace, click "New Project", add a Project Name, select "Git" as the Initial Setup option, copy in the repo URL, click "Create Project". Then, follow the installation instructions below.

Installation

The code and applications were developed using Python 3.6.9, and are likely also to function with more recent versions of Python.

To install dependencies, first create and activate a new virtual environment through your preferred means, then pip install from the requirements file. We recommend:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

In CML or CDSW, no virtual env is necessary. Instead, inside a Python 3 session (with at least 4 vCPU / 6 GiB Memory), simply run

!pip3 install -r requirements.txt     # notice `pip3`, not `pip`

Starting the application as a normal python session

First, specify the port in apps/app.py by uncommenting the lines for normal python session

# for normal python session uncomment below
server = app.server

# Running server
if __name__ == '__main__':
  # for running on CDSW / CML uncomment below
  # app.run_server(port=os.getenv("CDSW_APP_PORT"))
  # OR 
  # for normal python session uncomment below
  app.run_server(debug=True)

Run
```
python apps/app.py
```

Starting the application within CML or CDSW

First, specify the port in apps/app.py by commenting the lines for normal python session

# for normal python session uncomment below
# server = app.server

# Running server
if __name__ == '__main__':
  # for running on CDSW / CML uncomment below
  app.run_server(port=os.getenv("CDSW_APP_PORT"))
  # OR 
  # for normal python session uncomment below
  # app.run_server(debug=True)

Second, set the subdomain in CDSW's Applications tab.
Third, enter apps/app.py in the Script field in CDSW's Applications tab.
Fourth, start the application within CDSW.
Finally, access demo at subdomain.ffl-4.cdsw.eng.cloudera.com

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
activelearning		activelearning
apps		apps
assets		assets
cml		cml
data/MNIST/MNIST		data/MNIST/MNIST
docs/images		docs/images
experiments		experiments
.gitignore		.gitignore
.project-metadata.yaml		.project-metadata.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interactive and visual workflow of active learning

What is Active Learning?

Structure

`activelearning`

`apps`

`experiments`

`data`

Deploying on CML

Installation

Starting the application as a normal python session

Starting the application within CML or CDSW

About

Releases

Packages

Contributors 6

Languages

License

cloudera/CML_AMP_Active_Learning

Folders and files

Latest commit

History

Repository files navigation

Interactive and visual workflow of active learning

What is Active Learning?

Structure

activelearning

apps

experiments

data

Deploying on CML

Installation

Starting the application as a normal python session

Starting the application within CML or CDSW

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

`activelearning`

`apps`

`experiments`

`data`

Packages