A tensorflow version implementation of ConditionalLearnToPayAttention.
Designing a conditional attention mechanism to solve sequential visual task such as multiple objects recognition and image caption.
SVHN is obtained from house numbers in Google Street View images. The dataset can be download here(format1).
Mscoco is a dataset built by Microsoft, which includes detection, segmentation, keypoints and other tasks. The dataset can be download here
python 3.6
tensorflow 1.4.0
numpy 1.15.0
matplotlib 2.0.0
skimage 0.15.0
We generate images with bounding boxes, and resize the images to 64×64.
We then use the similar data augmentation which crops a 54×54 pixel image from a random location within the 64×64 pixel image in Goodfellow et al. (2013).
In order to verify the universality of the model, we directly resize the orginal images in SVHN dataset without bounding boxes, and the results outperform than the method in Goodfellow et al. (2013)
Also we use multiple scale attention features to improve performance, and for different attention scales, the method of training model is the same.
Run python convert_to_tfrecords.py
, you can get three tfrecords files(train,val,test) with bounding box.
Run python main.py
We have only reprocessed the data, and the structure and training of our model have not changed, so we only need to run python convert_to_tfrecords.py
to generate new weakly labeled data.
Run python convert_to_tfrecords.py
, you can get three data tfrecords files(train,val,test) without bounding box.
The image caption code mainly refers to this author who has implemented the paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Download the COCO train2014 and val2014 data. Put the COCO train2014 images in the folder train/images
, and put the file captions_train2014.json
in the folder train.
Similarly, put the COCO val2014 images in the folder val/images
, and put the file captions_val2014.json
in the folder val.
Furthermore, download the pretrained VGG16 net here if you want to use it to initialize the CNN part.
Run python main.py --phase=train --load=False --load_cnn=True --cnn_model_file='./vgg16_no_fc.npy' --train_cnn=True --beam_size=3
Run python main.py --phase=eval --load=True --model_file='./models/xxxx.npy' --load_cnn=False --train_cnn=False --beam_size=3
.
The crop svhn recognition accuracy of this soft attention model is reached 97.15%
than baseline CNN model 96.04%
here.
The weakly svhn recognition accuracy of the soft attention model is reached 80.45%
than baseline CNN model 70.58%
.
All qualitative and quantitative results are all exported to the svhn.log, you can print some other results to the logs if you are interested.
You also can view results in tensorboard.
Run tensorboard --logdir=logs
.
The image caption model was trained on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with beam size=3):
BLEU-1 = 70.9
BLEU-2 = 54.1
BLEU-3 = 40.5
BLEU-4 = 30.3
METEOR = 23.9
CIDEr = 89.5
You also can view results in tensorboard.
Run tensorboard --logdir=summary
.
Attention maps from conditional attention model trained on SVHN dataset with/without bounding box, or valisualization of image caption can be seen in our paper.