One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Zechen Bai ¹ Tong He ² Haiyang Mei ¹ Pichao Wang ²

Ziteng Gao ¹ Joya Chen ¹ Lei Liu ² Zheng Zhang ² Mike Zheng Shou ¹

NeurIPS 2024

¹ Show Lab, National University of Singapore ² Amazon

News

[2024-12-08] We updated the inference example and evaluation instructions on all datasets.
[2024-11-27] We released the ReasonVOS benchmark!
[2024-11-26] We released pre-trained VideoLISA-3.8B at HuggingFace!.
[2024-11-20] We released the training and inference code.
[2024-09-29] We released our paper on arXiv.

TODO

Release the inference code.
Release the training code.
Instructions on supporting more datasets.

Setup Environment

git clone https://github.com/showlab/VideoLISA.git

conda create -n videolisa python=3.10 -y
conda activate videolisa
pip install --upgrade pip  # enable PEP 660 support


# for cuda 11.8
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118
# for cuda 12.1
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121

pip install -e .

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3
pip install flash-attn --no-build-isolation

Inference Example

CUDA_VISIBLE_DEVICES=0 python chat.py \
  --version="ZechenBai/VideoLISA-3.8B" \
  --vision_tower="openai/clip-vit-large-patch14-336" \
  --num_frames_dense=4 \
  --num_frames_sparse=32 \
  --save_overlay

> Please input your prompt: In this video, there is something that shocks the cat and makes it jump. Can you find the object?
> Please input the video path: examples/RBrZsgy4-SQ.mp4

Prepare Data for Training

First, please prepare the image data following this instruction in LISA.

We introduce the video datasets used in this project. Note that the data paths for video datasets are currently hard-coded in each dataset file in the utils folder. You may need to adjust it accordingly.

ReasonVOS

Please refer to BENCHMARK.md

MeViS

Download the dataset from the official release. Then, extract and organize the file. We expect the directory structure to be the following:

mevis
├── train                       // Split Train
│   ├── JPEGImages
│   │   ├── <video #1  >
│   │   ├── <video #2  >
│   │   └── <video #...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
├── valid_u                     // Split Val^u
│   ├── JPEGImages
│   │   └── <video ...>
│   │
│   ├── mask_dict.json
│   └── meta_expressions.json
│
└── valid                       // Split Val
    ├── JPEGImages
    │   └── <video ...>
    │
    └── meta_expressions.json

Ref-YouTube-VOS and Ref-DAVIS-17

Prepare Ref-YouTube-VOS and Ref-DAVIS-17 datasets following the instructions of ReferFormer.

YouTube-VOS

Download teh dataset from the website and organize it as follows:

YTVOS
├── train
│   ├── JPEGImages
│   ├── Annotations
│   ├── meta.json

Training

We provide a sample training script in run_train.sh. In our own experiments, we use 8 node (64 A10 24G GPUs) in total to train the model. Under this setting, we set batch_size=2 and grad_accumulation_steps=1, so that the global effective batch size is batch_size*grad_accumulation_steps*num_gpus=128. You can modify these settings based on your hardwares. However, we did not explore other training hyper-parameters. If you don't have sufficient GPUs, don't give up, you may still try to train the model with small batch size. One tip: if you use small batch size, also reducing the learning rate might help.

After training finished, to get the full model weight:

cd ./runs/video-lisa-3.8b-3k-iter/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Weight merging

Since the script do LoRA training with the help of deepspeed by default, after training, you need to merge the lora weights back to the model.

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="MBZUAI/LLaVA-Phi-3-mini-4k-instruct" \
  --weight="runs/video-lisa-3.8b-3k-iter/pytorch_model.bin" \
  --save_path="runs/video-lisa-3.8b-3k-iter/merged"

Evaluation

MeViS

Before jumping into the follow commands, you may look into the involved scripts and config the data paths.

# Step 1
bash evaluation/mevis_val_u/run_inference_mevis.sh

# Step 2
bash evaluation/mevis_val_u/run_eval_mevis.sh

ReasonVOS

# Step 1
bash evaluation/reason_vos/run_inference_reason_vos.sh

# Step 2
bash evaluation/reason_vos/run_eval.sh

Ref-YouTube-VOS

bash evaluation/refytvos/run_inference_refytvos.sh

Submit your result to the online evaluation server.

Ref-DAVIS-17

# Step 1
bash evaluation/refdavis/run_inference_refdavis.sh

# Step 2
bash evaluation/refdavis/run_post_process.sh

Citation

To cite the paper and model, please use the below:

@article{bai2024one,
  title={One token to seg them all: Language instructed reasoning segmentation in videos},
  author={Bai, Zechen and He, Tong and Mei, Haiyang and Wang, Pichao and Gao, Ziteng and Chen, Joya and Liu, Lei and Zhang, Zheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2409.19603},
  year={2024}
}

Acknowledgments

This work is heavily based on LISA, LLaVA, LLaVA-pp, Segment-Anything and Phi-3. Thanks to all the authors for their great works!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

TODO

Setup Environment

Inference Example

Prepare Data for Training

ReasonVOS

MeViS

Ref-YouTube-VOS and Ref-DAVIS-17

YouTube-VOS

Training

Weight merging

Evaluation

MeViS

ReasonVOS

Ref-YouTube-VOS

Ref-DAVIS-17

Citation

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

TODO

Setup Environment

Inference Example

Prepare Data for Training

ReasonVOS

MeViS

Ref-YouTube-VOS and Ref-DAVIS-17

YouTube-VOS

Training

Weight merging

Evaluation

MeViS

ReasonVOS

Ref-YouTube-VOS

Ref-DAVIS-17

Citation

Acknowledgments