In this project, I use AWS Sagemaker to train a pretrained model that can perform image classification by using the Sagemaker profiling, debugger, hyperparameter tuning and other good ML engineering practices. This was done on a dog breed dataset.
The provided dataset is the dogbreed classification dataset which can be found in the classroom. The project is designed to be dataset independent so if there is a dataset that is more interesting or relevant to your work, you are welcome to use it to complete the project.
- The pre-trained model that I chose for this project was Inception v3. This is an image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset. The model is the culmination of many ideas developed by multiple researchers over the years. It is based on the original paper: "Rethinking the Inception Architecture for Computer Vision" by Szegedy, et. al.
- The model itself is made up of symmetric and asymmetric building blocks, including convolutions, average pooling, max pooling, concatenations, dropouts, and fully connected layers. Batch normalization is used extensively throughout the model and applied to activation inputs. Loss is computed using Softmax.
A high-level diagram of the model is shown in the following screenshot:
-
The following hyperparameters were selected for tuning:
-
Learning rate - learning rate defines how fast the model trains. A large learning rate allows the model to learn faster, with a small learning rate it takes a longer time for the model to learn but with more accuracy. The range is from 0.001 to 0.1.
-
Batch size - batch size is the number of examples from the training dataset used in the estimate of the error gradient. Batch size controls the accuracy of the estimate of the error gradient when training neural networks. The batch-size we choose between two numbers 64 and 128.
-
epoch - epochs is the number of times that the learning algorithm will work through the entire training dataset. The epochs we choose between two numbers 2 and 5.
-
-
The best hyperparameters selected were: {'batch-size': '128', 'lr': '0.003484069065132129', 'epochs': '2'}
Two plots show dependence between loss and step: first one shows the train_loss/steps
, the second one shows the test_loss/steps
.
As we see there are some anomalous behaviour in the debugging output:
- In the
train_loss/steps
as steps are increased the loss is decreased. The graph is smooth. - In the
test_loss/steps
as steps are increased we cannot say the loss is decreased. The graph isn't smooth.
Also noticed that:
- No rules were triggered during the process
- The average step duration was 13.1s
Here are some ways that may help to fix the anomalous behaviour seen:
- Adding more hyperparameters to tune.
- Increasing hyperparameter ranges for hpo tuning.
- Increasing
max_jobs
for hpo tuning. - Adding more Fully Connected layers to the pretrained model.
The model is deployed using inference.py
script.
- The dog images I use must be downloaded from here.
- Test images I use are stored in the
dogImages/test/
folder. - Scripts to predict on the Endpoint:
- store image path in
image_path
- prepare image and store it as
payload
:response = object.get()
payload = response['Body'].read()
- run prediction:
response = predictor.predict(payload, initial_args={"ContentType": "image/jpeg"})
response= json.loads(response.decode())
predicted_dog_breed_idx = np.argmax(response,1)[0]
- store image path in
In train_and_deploy.ipynb
I run 4 test predictions, and the predictions are pretty accurate.