This is the github repository for paper Latent Alignment of Procedural Concepts in Multimodal Recipes published in ALVR2020 ( an ACL2020 workshop).
To start you have to download the images-qa from the RecipeQA website and unpack them in the main folder. Download and move the following image representations to the main folder.
To run the program you have to run the following code.
python main.py
You can use the following options.
-i for number of iterations
-n for number of samples in use
-m for the mode of "train" or "test"
-s for the set of "train", "test" or "valid"
-l for using the stored models or not (-l True)
-c to specify the gpu number
-p to specify the main folder for the experiment ( Save and load)
-f to specify the txt file path and name for saving log
-a for architecture number (7,8,9)
-e specifying the embedding type ( 1 for bert, 2 for flair, 3 for xlnet)
-o For specifying the loss mode ("one" for objective 1 and "all" for objective 2)
-r for specifying the learning rate
-x for enabling or desabling the modified max pooling
You have to run Stanford core nlp service at port :9000
Please also install
flair, torch, torchvision, PIL, tqdm, pickle, pycorenlp, numpy and math
plugins from python 3.
The image representations are the results of the last layer before classification of a resnet50 neural network. The network is a pretrained version in torchvision model zoo. The output of the network for each picture is a 2048 vector representation
The word embedding is a pretrained Bert model. We use Flair in order to get the results of the pretraining.
in some images the mode is L (grayscale) which gives a different representation from transformers of pytorch. As a result we convert all pictures to RGB before applying resnet on them.
we use StanfordCoreNLP to detect sentences from an instruction text body.
in some cases the answer set contains ' '
which we have to remove.
You have to use -a 7
for running the experiment with simple multimodal
, -a 8
for the experiment with LXMERT
and -a 9
for the experiment of unimodal.