Instruction-Guided Visual Masking

[📚paper] [project page] [🤗Dataset] [🤗model]

🔥News [2024.09.26] IVM has been accepted by NeurIPS 2024

🔥News [2024.07.21] IVM has been selected as outstanding paper at MFM-EAI workshop @ICML2024

Introduction

We introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models.

Duck on green plate | Red cup on red plate | Red cup on red plate | Red cup on silver pan | Red cup on silver pan

Quick Start

Install

Clone this repository and navigate to IVM folder

git clone https://github.com/2toinf/IVM.git
cd IVM

Install Package

conda create -n IVM python=3.10 -y
conda activate IVM
pip install -e .

Usage

from IVM import load, forward_batch
ckpt_path = "IVM-V1.0.bin" # your model path here
model = load(ckpt_path, low_gpu_memory = False) # Set `low_gpu_memory=True` if you don't have enough GPU Memory
image = Image.open("image/demo/robot.jpg") # your image path
instruction = "pick up the red cup and place it on the green pan" 
result = forward_batch(model, [image], [instruction], threshold = 0.99)
from matplotlib import pyplot as plt
import numpy as np
plt.imshow((result[0]).astype(np.uint8))

For more intresting cases, please refer to demo.ipynb

Model Zoo

Models	basemodel	Params (M)	Iters	ckpt
IVM-V1.0	LLava-1.5-7B + SAM-H	64M	1M	HF-link

We welcome everyone to further explore more IVM training methods and further scale it up!.

Evaluation

Please first preprocess the test images using our IVM model, then follow the official instructions for evaluation.

VQA-type benchmarks

V* Bench: https://github.com/penghao-wu/vstar?tab=readme-ov-file#evaluation

Traditional VQA benchmark: https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#evaluation

Real-Robot

Policy Learning: https://github.com/Facebear-ljx/BearRobot

Robot Infrastructure: https://github.com/rail-berkeley/bridge_data_robot

IVM-Mix-1M Dataset

Please download the annotations of our IVM-Mix-1M. We provide over 1M image-instruction pairs with corresponding mask labels. Our IVM-Mix-1M dataset consists of three part: HumanLabelData, RobotMachineData and [VQAMachineData]. For the HumanLabelData and RobotMachineData, we provide well-orgnized images, mask label and language instructions. For the VQAMachineData, we only provide mask label and language instructions and please download the images from constituting datasets.

COCO: train2017, train2014
GQA: images
TextVQA: train_val_images
VisualGenome: part1, part2
Flickr30k: homepage
Open images: download script, we only use 0-5 splits
VSR: images

After downloading all of them, organize the data as follows,

├── coco
│   └── train2017
│   └── train2014
├── gqa
│   └── images
├── textvqa
│   └── train_images
└── vg
│   ├── VG_100K
│   └── VG_100K_2
└── flickr30k
│   └── images
└── vsr
└── openimages

We provide a sample code for reading data as a reference.

Acknowledgement

This work is built upon the LLaVA and SAM and LISA.

Citation

@article{zheng2024instruction,
  title={Instruction-Guided Visual Masking},
  author={Zheng, Jinliang and Li, Jianxiong and Cheng, Sijie and Zheng, Yinan and Li, Jiaming and Liu, Jihao and Liu, Yu and Liu, Jingjing and Zhan, Xianyuan},
  journal={arXiv preprint arXiv:2405.19783},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
IVM_1M_Mix		IVM_1M_Mix
assets		assets
image		image
model		model
IVM.py		IVM.py
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
index.html		index.html
setup.py		setup.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instruction-Guided Visual Masking

🔥News [2024.09.26] IVM has been accepted by NeurIPS 2024

🔥News [2024.07.21] IVM has been selected as outstanding paper at MFM-EAI workshop @ICML2024

Introduction

Content

Quick Start

Install

Usage

Model Zoo

Evaluation

VQA-type benchmarks

Real-Robot

IVM-Mix-1M Dataset

Acknowledgement

Citation

About

Releases

Packages

Contributors 3

Languages

License

2toinf/IVM

Folders and files

Latest commit

History

Repository files navigation

Instruction-Guided Visual Masking

🔥News [2024.09.26] IVM has been accepted by NeurIPS 2024

🔥News [2024.07.21] IVM has been selected as outstanding paper at MFM-EAI workshop @ICML2024

Introduction

Content

Quick Start

Install

Usage

Model Zoo

Evaluation

VQA-type benchmarks

Real-Robot

IVM-Mix-1M Dataset

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages