This neural system for image captioning is motivated by the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses faster rcnn model to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and currently allows training of RNN part only.
- Tensorflow (instructions)
- NumPy (instructions)
- OpenCV (instructions)
- Natural Language Toolkit (NLTK) (instructions)
- Pandas (instructions)
- Matplotlib (instructions)
- tips:
- delete all pycache folders under current directory
find . -name '__pycache__' -type d -exec rm -rf {} \;
- Dataset Preparing:
- download faster_rcnn_resnet checkpoint
cd data
wget http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet50_coco_2018_01_28.tar.gz
tar -xzf faster_rcnn_resnet50_coco_2018_01_28.tar.gz
- frozen graph using checkpoint
export PYTHONPATH=$PYTHONPATH:./object_detection/
python ./object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path ../data/faster_rcnn_resnet50_coco_2018_01_28/pipeline.config --trained_checkpoint_prefix ../data/faster_rcnn_resnet50_coco_2018_01_28/model.ckpt --output_directory ../data/faster_rcnn_resnet50_coco_2018_01_28/exported_graphs
cp ../data/faster_rcnn_resnet50_coco_2018_01_28/exported_graphs/frozen_inference_graph.pb ../data/frozen_faster_rcnn.pb
- skip if have download coco dataset, else run the following command to get coco
OUTPUT_DIR="/home/zisang/im2txt"
sh ./dataset/download_mscoco.sh.sh ../data/coco
- get feature for each region proposal(100*2048)
for coco run the following command
DATASET_DIR="/home/zisang/Documents/code/data/mscoco/raw-data"
OUTPUT_DIR="/home/zisang/im2txt/data/coco"
python ./dataset/build_data.py \
--graph_path="../data/frozen_faster_rcnn.pb" \
--dataset "coco" \
--train_image_dir="${DATASET_DIR}/train2014" \
--val_image_dir="${DATASET_DIR}/val2014" \
--train_captions_file="${DATASET_DIR}/annotations/captions_train2014.json" \
--val_captions_file="${DATASET_DIR}/annotations/captions_val2014.json" \
--output_dir="${OUTPUT_DIR}" \
--word_counts_output_file="${OUTPUT_DIR}/word_counts.txt"
for flickr8k
DATASET_DIR="/home/zisang/Documents/code/data/Flicker8k"
OUTPUT_DIR="/home/zisang/im2txt/data/flickr8k"
python ./dataset/build_data.py \
--graph_path "../data/frozen_faster_rcnn.pb" \
--dataset "flickr8k" \
--min_word_count 2 \
--image_dir "${DATASET_DIR}/Flicker8k_Dataset/" \
--text_path "${DATASET_DIR}/" \
--output_dir "${OUTPUT_DIR}" \
--train_shards 32\
--num_threads 8
- Training:
First make sure you are under the folder
code
, then setup various parameters in the fileconfig.py
and then run a command like this:
python train.py --input_file_pattern='../data/flickr8k/train-?????-of-00016' \
--number_of_steps=100000 \
--attention='bias' \
--optimizer='Adam' \
--train_dir='../output/model'
To monitor the progress of training, run the following command:
tensorboard --logdir='../output/model'
- Evaluation: To evaluate a trained model using the flickr30 data, run a command like this:
python eval.py --input_file_pattern='../data/flickr8k/val-?????-of-00008' \
--checkpoint_dir='../output/model' \
--attention='bias' \
--eval_dir='../output/eval' \
--min_global_step=10 \
--num_eval_examples=32 \
--vocab_file="../data/flickr8k/word_counts.txt" \
--beam_size=3 \
--save_eval_result_as_image \
--eval_result_dir='../val/results/' \
--val_raw_image_dir='/home/zisang/Documents/code/data/Flicker8k/Flicker8k_Dataset'
The result will be shown in stdout and stored in eval_dir as tensorflow summary.
- Inference: A web interface was built using Flask. You can use the trained model to generate captions for any JPEG images!
1 - Install Flask
pip install Flask
2 - First get the frozen graph:
python export.py --model_folder='../output/model' \
--output_path='../data/frozen_lstm.pb' \
--attention='bias'
Run Flaskr
python server.py --mode att-nic \
--vocab_path ../output/vocabulary.csv
or run the following to see our results
python server.py --mode ours \
--faster_rcnn_model_file="../data/frozen_faster_rcnn.pb" \
--lstm_model_file="../data/frozen_lstm.pb" \
--vocab_path="../data/flickr8k/word_counts.txt"
3 - Picture test interface http://127.0.0.1:5000
4 - Admin log in http://127.0.0.1:5000/admin to see more information
Username: admin Password: 0000
This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with beam size=3
):
- BLEU-1 = 0.702
- BLEU-2 = 0.534
- BLEU-3 = 0.394
- BLEU-4 = 0.291
- METEOR = 0.234
- Rouge = 0.516
- CIDEr = 0.849
- Perplexity = 6.4 compared to Show, Attend and Tell, which have achieved the following performance:
- BLEU-1 = 70.3%
- BLEU-2 = 53.6%
- BLEU-3 = 39.8%
- BLEU-4 = 29.5% which acheved similar performance with fc2 attend, our method is of much less computation during attention vector generation.
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML 2015.
- The original implementation in Theano
- An earlier implementation in Tensorflow
- Tensorflow models im2txt