Skip to content

zisang0210/im2txt

Repository files navigation

Introduction

This neural system for image captioning is motivated by the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses faster rcnn model to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and currently allows training of RNN part only.

Prerequisites

Usage

  • tips:
  1. delete all pycache folders under current directory
find . -name '__pycache__' -type d -exec rm -rf {} \;
  • Dataset Preparing:
  1. download faster_rcnn_resnet checkpoint
cd data
wget http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet50_coco_2018_01_28.tar.gz
tar -xzf faster_rcnn_resnet50_coco_2018_01_28.tar.gz
  1. frozen graph using checkpoint
export PYTHONPATH=$PYTHONPATH:./object_detection/
python ./object_detection/export_inference_graph.py \
    --input_type image_tensor \
    --pipeline_config_path ../data/faster_rcnn_resnet50_coco_2018_01_28/pipeline.config --trained_checkpoint_prefix ../data/faster_rcnn_resnet50_coco_2018_01_28/model.ckpt  --output_directory ../data/faster_rcnn_resnet50_coco_2018_01_28/exported_graphs
cp ../data/faster_rcnn_resnet50_coco_2018_01_28/exported_graphs/frozen_inference_graph.pb  ../data/frozen_faster_rcnn.pb
  1. skip if have download coco dataset, else run the following command to get coco
OUTPUT_DIR="/home/zisang/im2txt"
sh ./dataset/download_mscoco.sh.sh ../data/coco
  1. get feature for each region proposal(100*2048)

for coco run the following command

DATASET_DIR="/home/zisang/Documents/code/data/mscoco/raw-data"
OUTPUT_DIR="/home/zisang/im2txt/data/coco"
python ./dataset/build_data.py \
  --graph_path="../data/frozen_faster_rcnn.pb" \
  --dataset "coco" \
  --train_image_dir="${DATASET_DIR}/train2014" \
  --val_image_dir="${DATASET_DIR}/val2014" \
  --train_captions_file="${DATASET_DIR}/annotations/captions_train2014.json" \
  --val_captions_file="${DATASET_DIR}/annotations/captions_val2014.json" \
  --output_dir="${OUTPUT_DIR}" \
  --word_counts_output_file="${OUTPUT_DIR}/word_counts.txt" 

for flickr8k

DATASET_DIR="/home/zisang/Documents/code/data/Flicker8k"
OUTPUT_DIR="/home/zisang/im2txt/data/flickr8k"
python ./dataset/build_data.py \
  --graph_path "../data/frozen_faster_rcnn.pb" \
  --dataset "flickr8k" \
  --min_word_count 2 \
  --image_dir "${DATASET_DIR}/Flicker8k_Dataset/" \
  --text_path "${DATASET_DIR}/" \
  --output_dir "${OUTPUT_DIR}" \
  --train_shards 32\
  --num_threads 8
  • Training: First make sure you are under the folder code, then setup various parameters in the file config.py and then run a command like this:
python train.py --input_file_pattern='../data/flickr8k/train-?????-of-00016' \
    --number_of_steps=100000 \
    --attention='bias' \
    --optimizer='Adam' \
    --train_dir='../output/model'

To monitor the progress of training, run the following command:

tensorboard --logdir='../output/model'
  • Evaluation: To evaluate a trained model using the flickr30 data, run a command like this:
python eval.py --input_file_pattern='../data/flickr8k/val-?????-of-00008' \
    --checkpoint_dir='../output/model' \
    --attention='bias' \
    --eval_dir='../output/eval' \
    --min_global_step=10 \
    --num_eval_examples=32 \
    --vocab_file="../data/flickr8k/word_counts.txt" \
    --beam_size=3 \
    --save_eval_result_as_image \
    --eval_result_dir='../val/results/' \
    --val_raw_image_dir='/home/zisang/Documents/code/data/Flicker8k/Flicker8k_Dataset'

The result will be shown in stdout and stored in eval_dir as tensorflow summary.

  • Inference: A web interface was built using Flask. You can use the trained model to generate captions for any JPEG images!

1 - Install Flask

pip install Flask

2 - First get the frozen graph:

python export.py --model_folder='../output/model' \
    --output_path='../data/frozen_lstm.pb' \
    --attention='bias'

Run Flaskr

python server.py --mode att-nic \
    --vocab_path ../output/vocabulary.csv

or run the following to see our results

python server.py --mode ours \
    --faster_rcnn_model_file="../data/frozen_faster_rcnn.pb" \
    --lstm_model_file="../data/frozen_lstm.pb" \
    --vocab_path="../data/flickr8k/word_counts.txt"

3 - Picture test interface http://127.0.0.1:5000

4 - Admin log in http://127.0.0.1:5000/admin to see more information

Username: admin Password: 0000

Results

This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with beam size=3):

  • BLEU-1 = 0.702
  • BLEU-2 = 0.534
  • BLEU-3 = 0.394
  • BLEU-4 = 0.291
  • METEOR = 0.234
  • Rouge = 0.516
  • CIDEr = 0.849
  • Perplexity = 6.4 compared to Show, Attend and Tell, which have achieved the following performance:
  • BLEU-1 = 70.3%
  • BLEU-2 = 53.6%
  • BLEU-3 = 39.8%
  • BLEU-4 = 29.5% which acheved similar performance with fc2 attend, our method is of much less computation during attention vector generation.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •