Skip to content

Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

License

MIT and 2 other licenses found

Licenses found

MIT
LICENSE
MIT
METER_LICENSE
Apache-2.0
ViLT_LICENSE
Notifications You must be signed in to change notification settings

LooperXX/ManagerTower

ManagerTower

This repo is the official Pytorch implementation of the paper:

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan.

ACL 2023 (Oral) | Association for Computational Linguistics

Paper | Arxiv | Model | Slides | Video(EN) | Video(CN) | Blog(CN) | Tweet(EN)

Abstract

Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K. Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.

Architecture

Architecture

BridgeTower vs. ManagerTower

Comparison

Main Results

Result

Visualization

Visualization

Deployment

  • Run setup.sh to set up the environment.
  • [Optional] We use wandb to track experiments! Please remember to wandb login and paste your token before running the script.

Dataset Preparation

Checkpoints

We provide the following checkpoints for reproducing our results. You can download them from here.

Pre-training on Image-Text Datasets

# Pre-train ManagerTower Base Model
bash scripts/pre_train.sh

Fine-tuning on Downstream VL Tasks

  • VQAv2 Evaluation needs to submit the json file in the logs/ directory to eval.ai evaluation server to get the test-dev and/or test-std scores.
# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh

# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh

# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_flickr.sh

Citation

@article{xu2023managertower,
  title={ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning},
  author={Xu, Xiao and Li, Bei and Wu, Chenfei and Tseng, Shao-Yen and Bhiwandiwalla, Anahita and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
  journal={arXiv preprint arXiv:2306.00103},
  year={2023}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them:

About

Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Topics

Resources

License

MIT and 2 other licenses found

Licenses found

MIT
LICENSE
MIT
METER_LICENSE
Apache-2.0
ViLT_LICENSE

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published