This repository contains the relevant codes and datasets for the publication Image-conditioned human language comprehension and psychometric benchmarking of visual language models(In Proceedings of the 28th Conference on Computational Natural Language Learning (CoNLL 2024)) and the thesis Expectation-based comprehension of linguistic input: facilitation from visual context
-
RT_datasets: This Folder contain all the relevant processed(with added features such as surprisal versions from different models, parts of speech labels, correctness status of everry word read by every subject, frequancy, length, condition_ID of items etc) and raw versions of the reading time datasets(collected straight from IBEX). To view how the RT data collection maze experiment looked like, check out this video
-
all_analysis: This folder contains all R markdown notebooks containing the mixed effect lmer/bayesian regression model analysis and results reported in the publication.
-
final_grounding_data: This folder contains the collected raw data from the groundedness rating experiment
-
generating_features: This folder contains the codes for generating all kinds of features that we later used to predict reading time/error occurence/surprisal difference etc. Notable features include surprisals from 11 different VLMs(9)/LLMs(2), frequency, open/closed pos, length etc. This folder also has the codes for adding condition_IDs to the raw data.
-
groundedness_experiment_code: This folder includes all javascript and html files along with the stimuli image files necessary to develop the groundedness rating collection experiment. A notable addition to the existing jspsych experiment items are vertical sliders to collect groundedness rating in the most natural way. To view how this experiment looked like, please check out this video
-
img: Just a folder containing most of the generated plots for the paper/thesis
If you find our work and dataset useful, consider citing our work!
@inproceedings{pushpita-levy-2024-image,
title = "Image-conditioned human language comprehension and psychometric benchmarking of visual language models",
author = "Pushpita, Subha Nawer and
Levy, Roger P.",
editor = "Barak, Libby and
Alikhani, Malihe",
booktitle = "Proceedings of the 28th Conference on Computational Natural Language Learning",
month = nov,
year = "2024",
address = "Miami, FL, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.conll-1.34",
pages = "447--457",
abstract = "Large language model (LLM)s{'} next-word predictions have shown impressive performance in capturing human expectations during real-time language comprehension. This finding has enabled a line of research on psychometric benchmarking of LLMs against human language-comprehension data in order to reverse-engineer humans{'} linguistic subjective probability distributions and representations. However, to date, this work has exclusively involved unimodal (language-only) comprehension data, whereas much human language use takes place in rich multimodal contexts. Here we extend psychometric benchmarking to visual language models (VLMs). We develop a novel experimental paradigm, $\textit{Image-Conditioned Maze Reading}$, in which participants first view an image and then read a text describing an image within the Maze paradigm, yielding word-by-word reaction-time measures with high signal-to-noise ratio and good localization of expectation-driven language processing effects. We find a large facilitatory effect of correct image context on language comprehension, not only for words such as concrete nouns that are directly grounded in the image but even for ungrounded words in the image descriptions. Furthermore, we find that VLM surprisal captures most to all of this effect. We use these findings to benchmark a range of VLMs, showing that models with lower perplexity generally have better psychometric performance, but that among the best VLMs tested perplexity and psychometric performance dissociate. Overall, our work offers new possibilities for connecting psycholinguistics with multimodal LLMs for both scientific and engineering goals.",
}