Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang and Yu Qiao
- 2024/06/29: The instruction data for VideoChat2_HD is updated in VideoChat2-IT, which is helpful for more detailed and accurate responses.
- 2024/06/25: We release the branch of videochat2 using
vllm
, speed up the inference of videochat2. - 2024/06/19: 🎉🎉 Our VideoChat2 achieves the best performances among the open-sourced VideoLLMs on MLVU, a multi-task long video understanding benchmark.
- 2024/06/13: Fix some bug and give testing scripts/
⚠️ We replace some repeated (~30) QAs in MVBench, which may only affect the results by 0.5%.- 📢 We give the scripts for testing EgoSchema and Video-MME, please check the demo_mistral.ipynb and demo_mistral_hd.ipynb.
- 2024/06/07: 🔥🔥🔥 We release VideoChat2_HD, which is fine-tuned with high-resolution data and is capable of handling more diverse tasks. It showcases better performance on different benchmarks, especially for detailed captioning. Furthermore, it achieves 54.8% on Video-MME, the best score among 7B MLLMs. Have a try! 🏃🏻♀️🏃🏻
- 2024/06/06: We release VideoChat2_phi3, a faster model with robust performaces.
- 2024/05/22: We release VideoChat2_mistral, which shows better capacity on diverse tasks (60.4% on MVBench, 78.6% on NExT-QA, 63.8% on STAR, 46.4% on TVQA, 54.4% on EgoSchema-full and 80.5% on IntentQA). More details have been updated in the paper.
- 2024/04/05: MVBench is selected as Poster (Highlight)! 🎉🎉
- 2024/02/27: MVBench is accepted by CVPR2024! 🎉🎉
- 2023/12/17: Online Leaderboard:
- We maintain an online leaderboard on HuggingFace.
- Evaluation results of GPT-4V and Gemini Pro are added.
- 2023/12/04: Brief introduction:
- 2023/11/29: Release VideoChat2 and MVBench:
- VideoChat2 is a robust baseline built on UMT and Vicuna-v0.
- 2M diverse instruction data are released for effective tuning.
- MVBench is a comprehensive benchmark for video understanding.
Stage1 aligns UMT-L, the visual encoder, with QFormer to efficiently compress extensive visual inputs. Stage2 extends this connection to incorporate LLM, while Stage3 focuses on effective instruction tuning to enhance model performance.
We build a diver instruction data with 2M samples from 34 distince sources. Check DATA for more details.
ViT | QFormer | LLM | LoRA | Shell (Vicuna) | Model (Vicuna) | Shell (Mistral) | Model (Mistral) | Shell (Phi3) | Model (Phi3) | |
---|---|---|---|---|---|---|---|---|---|---|
Stage1 | ❄️ | 🔥 | 🚫 | 🚫 | config & run | 🤗ckpt | SAME | SAME | SAME | SAME |
Stage2 | 🔥 | 🔥 | ❄️ | 🚫 | config & run | 🤗ckpt | config & run | 🤗ckpt | config & run | 🤗ckpt |
Stage3 | 🔥 | 🔥 | ❄️ | 🔥 | config & run | 🤗ckpt | config & run | 🤗ckpt | config & run | 🤗ckpt |
Stage4_HD | 🔥 | 🔥 | ❄️ | 🔥 | - | - | config & run | 🤗ckpt | - | - |
You can refer to demo.ipynb
Model | MVBench | Video-MME | Video-MME w/ subtitles |
Video ChatGPT |
NExT-QA (in-domain) |
STAR (zero-shot) |
TVQA (zero-shot) |
EgoSchema (full) |
EgoSchema (subset) |
IntentQA (in-domain Val) |
IntentQA (in-domain Test) |
---|---|---|---|---|---|---|---|---|---|---|---|
VideoChat2 (Vicuna) |
51.1 | - | - | 2.98 | 68.6 | 59.0 | 40.6 | - | - | - | - |
VideoChat2 (Phi3) |
55.1 | - | - | 2.91 | 73.1 | 63.3 | 40.1 | 56.7 | 59.8 | 69.0 | 71.6 |
VideoChat2 (Mistral) |
60.4 | 42.3 | 54.6 | 2.95 | 78.6 | 63.8 | 46.4 | 54.4 | 63.6 | 80.5 | 81.9 |
VideoChat2_HD (Mistral) |
62.3 | 45.3 | 55.7 | 3.10 | 79.5 | 63.9 | 50.6 | 55.8 | 65.6 | 81.1 | 83.4 |
- (2024/06/07) For Video-MME, our current version has some missing videos and subtitles, see issue
- Missing videos: Short (2), Medium (3), Long (11)
- Missing subtitles: Short (93), Medium (52), Long (10)
- For VideoChatGPT, the VideoChat2_mistral and VideoChat2_phi3 are evaluated based on
gpt-3.5-turbo-0125
, while the VideoChat2_vicuna usedgpt-3.5-turbo-1106
.- For NExT-QA, we report in-domain results since the training set are used as instruction data.
- For STAR, we input 32 frames, but we input 16 frames for other datasets.
- For IntentQA, we report the result on validation and testing splits.
- For testing EgoSchema and Video-MME, please check the demo_mistral.ipynb and demo_mistral_hd.ipynb.
-
Prepare the envirment:
conda create -n videochat2 python=3.9 conda activate videochat2 pip install -r requirements.txt
-
Stage1 training:
- Download UMT-L/16 model and set
pretrained
in stage1_config
bash scripts/videochat_vicuna/run_7b_stage1.sh
- Download UMT-L/16 model and set
-
Stage2 training:
- Set
vit_blip_model_path
andllama_model_path
in vicuna_stage2_config, ormistral_model_path
in mistral_stage2_config - For VideoBLIP, you can download Stage1 model
- For LLM, please follow here to prepare vicuna-7b-v0. Or directly download Mistral-7B-Instruct-v0.2.
# Vicuna bash scripts/videochat_vicuna/run_7b_stage2.sh # Mistral bash scripts/videochat_mistral/run_7b_stage2.sh
- Set
-
Stage3 training:
- Download instruction data and set
data_dir
in instruction_data.py - Set
vit_blip_model_path
,llama_model_path
andvideochat2_model_path
in vicuna_stage3_config or mistral_stage3_config - You can download Stage2 model and create instruction data for your own tuning
# Vicuna bash scripts/videochat_vicuna/run_7b_stage3.sh # Mistral bash scripts/videochat_mistral/run_7b_stage3.sh
- Download instruction data and set
-
Runing demo:
- Jupyter Notebook: demo.ipynb
- Gradio:
# Set the related model path in configs/config.json and demo/demo.py python demo/demo.py
-
Evaluation:
- MVBench: mvbench.ipynb. The script is used for Vicuna, and for Mistral, please follow demo_mistral.ipynb to change the script.
- For VideoChatGPT Benchmark, we follow the original repo and use ChatGPT-3.5 to evalute the performances.
- For NExT-QA, STAR and TVQA, we follow SeViLA to prepare the data. And we simple modify mvbench.ipynb and directly output the options to calculate the accuracy.
📊 MVBench
We propose a comprehensive video understanding benchmark with 20 challenging video tasks, where our VideoChat2 secures the top ranking on 15 tasks. More details can be found here.
The online leaderboard is held in 🤗 Hugging Face.
If you find this project useful in your research, please consider cite:
@article{2023videochat,
title={VideoChat: Chat-Centric Video Understanding},
author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},
journal={arXiv preprint arXiv:2305.06355},
year={2023}
}
@misc{li2023mvbench,
title={MVBench: A Comprehensive Multi-modal Video Understanding Benchmark},
author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Yi Liu and Zun Wang and Jilan Xu and Guo Chen and Ping Luo and Limin Wang and Yu Qiao},
year={2023},
eprint={2311.17005},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Thanks to the open source of the following projects:
InternVid, UMT, MiniGPT-4, LLaVA, BLIP2, VideoChatGPT, Vicuna, M3-IT.