[📚paper] [project page] [🤗Dataset] [🤗model]
We introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models.
Duck on green plate | Red cup on red plate | Red cup on red plate | Red cup on silver pan | Red cup on silver pan
- Clone this repository and navigate to IVM folder
git clone https://github.com/2toinf/IVM.git
cd IVM
- Install Package
conda create -n IVM python=3.10 -y
conda activate IVM
pip install -e .
from IVM import load, forward_batch
ckpt_path = "IVM-V1.0.bin" # your model path here
model = load(ckpt_path, low_gpu_memory = False) # Set `low_gpu_memory=True` if you don't have enough GPU Memory
image = Image.open("image/demo/robot.jpg") # your image path
instruction = "pick up the red cup and place it on the green pan"
result = forward_batch(model, [image], [instruction], threshold = 0.99)
from matplotlib import pyplot as plt
import numpy as np
plt.imshow((result[0]).astype(np.uint8))
For more intresting cases, please refer to demo.ipynb
Models | basemodel | Params (M) |
Iters | ckpt |
---|---|---|---|---|
IVM-V1.0 | LLava-1.5-7B + SAM-H | 64M | 1M | HF-link |
We welcome everyone to further explore more IVM training methods and further scale it up!.
Please first preprocess the test images using our IVM model, then follow the official instructions for evaluation.
V* Bench: https://github.com/penghao-wu/vstar?tab=readme-ov-file#evaluation
Traditional VQA benchmark: https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#evaluation
Policy Learning: https://github.com/Facebear-ljx/BearRobot
Robot Infrastructure: https://github.com/rail-berkeley/bridge_data_robot
Please download the annotations of our IVM-Mix-1M. We provide over 1M image-instruction pairs with corresponding mask labels. Our IVM-Mix-1M dataset consists of three part: HumanLabelData, RobotMachineData and [VQAMachineData]. For the HumanLabelData
and RobotMachineData
, we provide well-orgnized images, mask label and language instructions. For the VQAMachineData
, we only provide mask label and language instructions and please download the images from constituting datasets.
- COCO: train2017, train2014
- GQA: images
- TextVQA: train_val_images
- VisualGenome: part1, part2
- Flickr30k: homepage
- Open images: download script, we only use 0-5 splits
- VSR: images
After downloading all of them, organize the data as follows,
├── coco
│ └── train2017
│ └── train2014
├── gqa
│ └── images
├── textvqa
│ └── train_images
└── vg
│ ├── VG_100K
│ └── VG_100K_2
└── flickr30k
│ └── images
└── vsr
└── openimages
We provide a sample code for reading data as a reference.
This work is built upon the LLaVA and SAM and LISA.
@article{zheng2024instruction,
title={Instruction-Guided Visual Masking},
author={Zheng, Jinliang and Li, Jianxiong and Cheng, Sijie and Zheng, Yinan and Li, Jiaming and Liu, Jihao and Liu, Yu and Liu, Jingjing and Zhan, Xianyuan},
journal={arXiv preprint arXiv:2405.19783},
year={2024}
}