Whodunnit? Inferring what happened from multimodal evidence

Materials for the paper "Whodunnit? Inferring what happened from multimodal evidence".

Sarah A. Wu*, Erik Brockbank*, Hannah Cha, Jan-Philipp Fränken, Emily Jin, Zhuoyi Huang, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Tobias Gerstenberg.

To be presented at the 46th Annual Conference of the Cognitive Science Society (2024; Rotterdam, Netherlands).

@inproceedings{wu2024whodunnit,
  title = {Whodunnit? Inferring what happened from multimodal evidence},
  booktitle = {Proceedings of the 46th {Annual} {Conference} of the {Cognitive} {Science} {Society}},
  author = {Wu*, Sarah A. and Brockbank*, Erik and Cha, Hannah and Fr\"anken, Jan-Philipp and Jin, Emily and Huang, Zhuoyi and Liu, Weiyu and Zhang, Ruohan and Wu, Jiajun and Gerstenberg, Tobias},
  year = {2024},
}

Contents:

Overview
Experiment
Repository structure
Code
CRediT author statement

Overview

Humans are remarkably adept at inferring the causes of events in their environment; doing so often requires incorporating information from multiple sensory modalities. For instance, if a car slows down in front of us, inferences about why they did so are rapidly revised if we also hear sirens in the distance. Here, we investigate the ability to reconstruct others' actions and events from the past by integrating multimodal information. Participants were asked to infer which of two agents performed an action in a household setting given either visual evidence, auditory evidence, or both. We develop a computational model that makes inferences by generating multimodal simulations, and also evaluate our task on a large language model (GPT-4) and a large multimodal model (GPT-4V). We find that humans are relatively accurate overall and perform best when given multimodal evidence. GPT-4 and GPT-4V performance comes close overall, but is very weakly correlated with participants across individual trials. Meanwhile, the simulation model captures the pattern of human responses well. Multimodal event reconstruction represents a challenge for current AI systems, and frameworks that draw on the cognitive processes underlying people's ability to reconstruct events offer a promising avenue forward.

Experiment

The experiment reported in these results was pre-registered on the Open Science Framework here. It can be previewed here!

Repository structure

├── code
│   ├── analysis
│   ├── generate_audio
│   ├── generate_visual
│   ├── gpt4
│   ├── model_data
│   └── simulation_model
├── data
├── docs
│   └── experiment
├── figures
└── writeup

/code: This folder contains the code for various aspects of the experiment and analyses.
- /analysis: contains all the code for analyzing data and generating figures (view a rendered file here).
- /generate_visual: contains code to generate the images for each trial from JSON specifications.
- /generate_audio: contains code to generate the audio files for each trial.
- /gpt4: This folder contains code to run GPT-4 and GPT-4V evaluations.
- /model_data: This folder has trial data in the format used by the simulation model and for GPT-4 and GPT-4V evaluations. The models use a combination of evidence images, scene graph JSON files, and a CSV with transcribed audio evidence for each trial.
- /simulation_model: This folder has code and output for the simulation model.
/data: contains anonymized participant data from the experiment as well as GPT-4 and GPT-4V evaluation results.
/docs/experiment: contains all the behavioral experiment code. You can demo the experiment here)!
/figures: contains all the figures from the paper, generated using the script in code/analysis.

Code

Refer to the documentation in the code directory for more details about the simulation model and running various parts of the code, including generating trial images, running evaluations on GPT-4(V), and running simulation model predictions.

CRediT author statement

What is a CRediT author statement?

Sarah A. Wu*: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Project administration
Erik Brockbank*: Conceptualization, Methodology, Software, Validation, Formal analysis, Resources, Writing - Original Draft, Writing - Review & Editing, Project administration
Hannah Cha: Conceptualization, Methodology, Software, Writing - Review & Editing, Visualization
Philipp Jan-Fränken: Conceptualization, Methodology, Software, Validation, Resources
Philipp Jan-Fränken: Conceptualization, Methodology, Software, Validation, Resources
Emily Jin: Conceptualization, Methodology
Weiyu Liu: Conceptualization, Methodology
Ruohan Zhang: Conceptualization, Methodology, Validation, Supervision
Jiajun Wu: Conceptualization, Supervision, Funding acquisition
Tobias Gerstenberg: Conceptualization, Methodology, Writing - Review & Editing, Supervision, Project administration, Funding acquisition

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
data		data
docs		docs
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Wu et al. - 2024 - Whodunnit? Inferring what happened from multimodal evidence.pdf		Wu et al. - 2024 - Whodunnit? Inferring what happened from multimodal evidence.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whodunnit? Inferring what happened from multimodal evidence

Overview

Experiment

Repository structure

Code

CRediT author statement

About

Releases

Packages

Contributors 3

Languages

License

cicl-stanford/whodunnit_multimodal_inference

Folders and files

Latest commit

History

Repository files navigation

Whodunnit? Inferring what happened from multimodal evidence

Overview

Experiment

Repository structure

Code

CRediT author statement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages