- MosaicML's Resnet50 Recipes Docker Image
- Tag:
mosaicml/pytorch_vision:resnet50_recipes
- The image comes pre-configured with the following dependencies:
- Mosaic ResNet Training recipes
- Training entrypoint:
train.py
- Composer Version: 0.7.1
- PyTorch Version: 1.11.0
- CUDA Version: 11.3
- Python Version: 1.9
- Ubuntu Version: 20.04
- Tag:
- Docker or your container orchestration framework of choice
- Imagenet Dataset
- System with Nvidia GPUs
As described in our blog post:
We actually cooked up three Mosaic ResNet recipes – which we call Mild, Medium, and Hot – to suit a range of requirements. The Mild recipe is for shorter training runs, the Medium recipe is for longer training runs, and the Hot recipe is for the very longest training runs that maximize accuracy.
To reproduce a specific run, two pieces of information are required:
-
recipe_yaml_path
: Path to the configuration file specifying the model and training parameters unique to each recipe. -
scale_schedule_ratio
: Factor which scales the duration of a particular run.
Note: The scale_schedule_ratio
is a scaling factor for max_duration
, each recipe sets a default max_duration = 90ep
(epochs). Thus a run with scale_schedule_ratio = 0.3
will run for 90 * 0.3 = 27
epochs.
First, choose a recipe you would like to work with: [Mild
, Medium
, Hot
]. This will determine which configuration file, recipe_yaml_path
, you will need to specify.
Next, determine the proper scale_schedule_ratio
to specify to reproduce the desired run by using MosaicML's Explorer. Explorer enables users to identify the most cost effective way to run training workloads across clouds and on different types of hardware backends for a variety of models and datasets. For this tutorial, we will focus on the Mosaic ResNet run data.
The table below provides the recipe_yaml_path
for the selected recipe and a link to the corresponding Explorer page which can be used to select a specific run and obtain the corresponding value for scale_schedule_ratio
:
Recipe | recipe_yaml_path |
Explorer link |
---|---|---|
Mild | recipes/resnet50_mild.yaml |
Mosaic Resnet Mild |
Medium | recipes/renset50_medium.yaml |
Mosaic Resnet Medium |
Hot | recipes/resnet50_hot.yaml |
Mosaic Resnet Hot |
You can also compare all three recipes here.
In this tutorial we will using the Mild
recipe and reproduce this run which results in a Top-1 accuracy of 76.19%. Thus, we see from the table above that the recipe_yaml_path = recipes/resnet50_mild.yaml
and from Explorer that scale_schedule_ratio = 0.32
for the desired run.
Now that we've selected a recipe and determined the recipe_yaml_path
and scale_schedule_ratio
to specify, let's kick off a training run.
-
Launch a Docker container using the
mosaicml/pytorch_vision:resnet50_recipes
image on your training system.docker pull mosaicml/pytorch_vision:resnet50_recipes docker run -it mosaicml/pytorch_vision:resnet50_recipes
Note: The
mosaicml/resnet50_recipes
Docker image can also be used with your container orchestration framework of choice. -
Download the ImageNet dataset from http://www.image-net.org/.
-
Create the dataset folder and extract training and validation images to the appropriate subfolders. The following script can be used to faciliate this process. Be sure to note the directory path of where you extracted the dataset.
Note: This tutorial assumes that the dataset is installed to the
/tmp/ImageNet
path. -
The
Mild
andMedium
recipes require converting the ImageNet dataset to FFCV format. This conversion step is only required to be performed once, once converted files can be stashed away for reuse with subsequent runs. TheHot
recipe uses the original ImageNet data.-
Download the helper conversion script:
wget -P /tmp https://raw.githubusercontent.com/mosaicml/composer/v0.7.1/scripts/ffcv/create_ffcv_datasets.py
-
Convert the training and validation datasets.
python /tmp/create_ffcv_datasets.py --dataset imagenet --split train --datadir /tmp/ImageNet/ python /tmp/create_ffcv_datasets.py --dataset imagenet --split val --datadir /tmp/ImageNet/
Note: The helper script output the FFCV formatted dataset files to
/tmp/imagenet_train.ffcv
and/tmp/imagenet_val.ffcv
for the training and validation data, respectively.
-
-
Launch the training run.
composer -n {num_gpus} train.py -f {recipe_yaml_path} --scale_schedule_ratio {scale_schedule_ratio}
Replace
num_gpus
,recipe_yaml_path
andscale_schedule_ratio
with the total number of GPU's, the recipe configuration, and the scale schedule ratio we determined in the previous section for the desired run, respectively.Note: The
Mild
andMedium
recipes assume the training and validation data is stored at the/tmp/imagenet_train.ffcv
and/tmp/imagenet_val.ffcv
paths while theHot
recipe assumes the original ImageNet dataset is stored at the/tmp/ImageNet
path. The default dataset paths can be overridden, please runcomposer -n {num_gpus} train.py -f {recipe_yaml_path} --help
for more detailed recipe specific configuration information.Example:
composer -n 8 train.py -f recipes/resnet50_mild.yaml --scale_schedule_ratio 0.32
The example above will train on 8 GPU's using the
Mild
recipe with a scale schedule ratio of 0.32. You can compare your run's final Top-1 accuracy and time to train to our result.