Skip to content

The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".

License

Notifications You must be signed in to change notification settings

SHTUPLUS/GITM-MR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Grounded Image Text Matching with Mismatched Relation Reasoning

This repository contains the official Python implementation for the ICCV 2023 paper 8957 Grounded Image Text Matching with Mismatched Relation Reasoning.

[project page] [paper] [supp] [preprint] [video]

Abstract

This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating vision-language (VL) models on this task, with a focus on the challenging settings of limited training data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained VL models often lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. Our RCRN can be interpreted as a modular program and delivers strong performance in terms of both length generalization and data efficiency.

GITM-MR Benchmark

We appreciate the contribution Ref-Reasoning [1] dataset, which our benchmark is constructed on. Explore our benchmark from the link. The structure and detail of the data directory is shown as follows:

└─data
    ├─counter        # The correspondence from the original expressions to mismatch ones.
    ├─expression     # Referring expression annotation files.
    ├─parse          # Parsed language scene graphs.
    ├─small          # Training subset annotations.
    ├─uniter         # UNITER checkpoints and BERT tokenizer.
    ├─vinvl_objects  # Detected boxes and features in h5 format.
    ├─word2token     # Word to UNITER token indices used in representation extraction.

The annotated images are GQA [2] images and can be downloaded from the official website, but our model doesn't necessitate the original images as input. Feel free to explore them based on your requirements.

Prerequisites and Installation

Our implementation is based on Detectron2 framework. You need to prepare the required packages and build the local Detectron2 from the repository. Refer to the Common Installation Issues section in the installation manual in Detectron2 might be helpful to debug the process.

  1. Prerequisites

    conda create -n gitm python=3.7
    pip install -r requirements.txt
  2. Installation

    python setup.py build develop

Reproduce the RCRN Result

  1. Download the model checkpoint from the link and put them into ckpt directory.

  2. Download all the dataset into data directory. The expected directory structure should be similar to:

    └─GITM-MR
        ├─data
        ├─ckpt
        ├─configs
        ├─detectron2
        ├─scripts
        ├─tools
  3. Run the evaluation process by:

    python tools/train_refdet.py --num-gpus $num_gpu --config-file configs/{RCRN_len16.yaml, RCRN_len11.yaml} --config configs/train-ng-base-1gpu.json --eval-only --resume OUTPUT_DIR $output_dir

    Specify your paralleled GPU number in $num_gpu and the output directory in $output_dir. Refer to scripts directory for the example.

  4. If necessary, refer to detectron2/modeling/refdet_heads/RCRN.py file to explore our model implementation.

References

[1] Sibei Yang, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9952–9961, 2020.

[2] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.

Citing GITM-MR

If you find our work useful for your research, please consider citing us:

@InProceedings{Wu_2023_ICCV,
    author    = {Wu, Yu and Wei, Yana and Wang, Haozhe and Liu, Yongfei and Yang, Sibei and He, Xuming},
    title     = {Grounded Image Text Matching with Mismatched Relation Reasoning},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {2976-2987}
}

Contact

Please feel free to contact us at [email protected] or [email protected] if you have further questions or comments.