Skip to content

Latest commit

 

History

History
129 lines (97 loc) · 8.56 KB

README.md

File metadata and controls

129 lines (97 loc) · 8.56 KB

Image Caption Generator

A neural network to generate captions for an image using CNN and RNN with BEAM Search.

Table of Contents

  1. Requirements
  2. Training parameters and results
  3. Generated Captions on Test Images
  4. Procedure to Train Model
  5. Procedure to Test on new images
  6. Configurations (config.py)
  7. Frequently encountered problems
  8. References

1. Requirements

Recommended System Requirements to train model.

  • A good CPU and a GPU with at least 8GB memory
  • At least 8GB of RAM
  • Active internet connection so that keras can download inceptionv3/vgg16 model weights

Required libraries for Python along with their version numbers used while making & testing of this project

  • Python - 3.6.7
  • Numpy - 1.16.4
  • Tensorflow - 1.13.1
  • Keras - 2.2.4
  • nltk - 3.2.5
  • PIL - 4.3.0
  • Matplotlib - 3.0.3
  • tqdm - 4.28.1

Flickr8k Dataset: Dataset Request Form

If the link above is unavaliable, you can try these direct download links:

Link Credit: Jason Brownlee

Important: After downloading the dataset, put the required files in train_val_data folder

2. Training parameters and results

NOTE
  • batch_size=64 took ~14GB GPU memory in case of InceptionV3 + AlternativeRNN and VGG16 + AlternativeRNN
  • batch_size=64 took ~8GB GPU memory in case of InceptionV3 + RNN and VGG16 + RNN
  • If you're low on memory, use google colab or reduce batch size
  • In case of BEAM Search, loss and val_loss are same as in case of argmax since the model is same
Model & Config Argmax BEAM Search
InceptionV3 + AlternativeRNN
  • Epochs = 20
  • Batch Size = 64
  • Optimizer = Adam
    Crossentropy loss
    (Lower the better)
  • loss(train_loss): 2.4050
  • val_loss: 3.0527
  • BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.596818
  • BLEU-2: 0.356009
  • BLEU-3: 0.252489
  • BLEU-4: 0.129536
    k = 3

    BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.606086
  • BLEU-2: 0.359171
  • BLEU-3: 0.249124
  • BLEU-4: 0.126599
InceptionV3 + RNN
  • Epochs = 11
  • Batch Size = 64
  • Optimizer = Adam
    Crossentropy loss
    (Lower the better)
  • loss(train_loss): 2.5254
  • val_loss: 3.1769
  • BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.601791
  • BLEU-2: 0.344289
  • BLEU-3: 0.230025
  • BLEU-4: 0.108898
    k = 3

    BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.605097
  • BLEU-2: 0.356094
  • BLEU-3: 0.251132
  • BLEU-4: 0.129900
VGG16 + AlternativeRNN
  • Epochs = 18
  • Batch Size = 64
  • Optimizer = Adam
    Crossentropy loss
    (Lower the better)
  • loss(train_loss): 2.2880
  • val_loss: 3.1889
  • BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.596655
  • BLEU-2: 0.342127
  • BLEU-3: 0.229676
  • BLEU-4: 0.108707
    k = 3

    BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.593876
  • BLEU-2: 0.348569
  • BLEU-3: 0.242063
  • BLEU-4: 0.123221
VGG16 + RNN
  • Epochs = 7
  • Batch Size = 64
  • Optimizer = Adam
    Crossentropy loss
    (Lower the better)
  • loss(train_loss): 2.6297
  • val_loss: 3.3486
  • BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.557626
  • BLEU-2: 0.317652
  • BLEU-3: 0.216636
  • BLEU-4: 0.105288
    k = 3

    BLEU Scores on Validation data
    (Higher the better)
  • BLEU-1: 0.568993
  • BLEU-2: 0.326569
  • BLEU-3: 0.226629
  • BLEU-4: 0.113102

3. Generated Captions on Test Images

Model used - VGG16 + AlternativeRNN ,

Image Caption
Image 1
  • BEAM Search, k=3: A man in a red shirt is climbing a rock.
Image 2
  • BEAM Search, k=3: A man in a wetsuit is riding a wave.

Photo Credits: Brad Bradmore and Sincerely Media on Unsplash.

4. Procedure to Train Model

  1. Clone the repository to preserve directory structure. git clone https://github.com/saharshy29/altML.git
  2. Put the required dataset files in train_val_data folder (files mentioned in readme there).
  3. Review config.py for paths and other configurations (explained below).
  4. Run train_val.py.

5. Procedure to Test on new images

  1. Clone the repository to preserve directory structure. git clone https://github.com/saharshy29/altML.git
  2. Train the model to generate required files in model_data folder (steps given above) OR use the previously trained model weights for VGG16+AlternativeRNN with a Beam Search, k=3.
  3. Put the test images in test_data folder.
  4. Review config.py for paths and other configurations (explained below).
  5. Run test.py.

6. Configurations (config.py)

config

  1. images_path :- Folder path containing flickr dataset images
  2. train_data_path :- .txt file path containing images ids for training
  3. val_data_path :- .txt file path containing imgage ids for validation
  4. captions_path :- .txt file path containing captions
  5. tokenizer_path :- path for saving tokenizer
  6. model_data_path :- path for saving files related to model
  7. model_load_path :- path for loading trained model
  8. num_of_epochs :- Number of epochs
  9. max_length :- Maximum length of captions. This is set manually after training of model and required for test.py
  10. batch_size :- Batch size for training (larger will consume more GPU & CPU memory)
  11. beam_search_k :- BEAM search parameter which tells the algorithm how many words to consider at a time.
  12. test_data_path :- Folder path containing images for testing/inference
  13. model_type :- CNN Model type to use -> inceptionv3 or vgg16
  14. random_seed :- Random seed for reproducibility of results

rnnConfig

  1. embedding_size :- Embedding size used in Decoder(RNN) Model
  2. LSTM_units :- Number of LSTM units in Decoder(RNN) Model
  3. dense_units :- Number of Dense units in Decoder(RNN) Model
  4. dropout :- Dropout probability used in Dropout layer in Decoder(RNN) Model

7. Frequently encountered problems

  • Out of memory issue:
    • Try reducing batch_size
  • Results differ everytime I run script:
    • Due to stochastic nature of these algoritms, results may differ slightly everytime. Even though I did set random seed to make results reproducible, results may differ slightly.
  • Results aren't very great using beam search compared to argmax:
    • Try higher k in BEAM search using beam_search_k parameter in config. Note that higher k will improve results but it'll also increase inference time significantly.

8. References