In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research.
- 2023/05/12: Release the 7B version:
- 🎊 Model-7B: 7B requires ~20GB GPU memory, while 13B requires ~32GB GPU memory.
- 2023/05/11: Release the 🦜VideoChat V1, which can handle both image and video understanding!
- 🎊 Model-13B and Data.
- 🤗 Online Demo
- 🧑🔧 Tuning scripts are cleaning.
- Small-scale video instuction data and tuning
- Instruction tuning on BLIP+UniFormerV2+Vicuna
- Large-scale and complex video instuction data
- Instruction tuning on strong video foundation model
- User-friendly interactions with longer videos
- ...
💬 Example Online🦜
Our VideoChat can handle both image and video understanding well!
-
Prepare the envirment.
pip install -r requirements.txt
-
Download BLIP2 model:
- ViT:
wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth
- QFormer:
wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
- Change the
vit_model_path
andq_former_model_path
in config.json or config_7b.json.
- ViT:
-
Download StabelVicuna model:
- LLAMA: Download it from the original repo or hugging face.
- If you download LLAMA from the original repo, please process it via the following command:
# convert_llama_weights_to_hf is copied from transformers python src/transformers/models/llama/convert_llama_weights_to_hf.py \ --input_dir /path/to/downloaded/llama/weights \ --model_size 13B --output_dir /output/path
- For 13B: Download stable-vicuna-13b-delta and process it:
# fastchat v0.1.10 python3 apply_delta.py \ --base /path/to/model_weights/llama-13b \ --target stable-vicuna-13b \ --delta CarperAI/stable-vicuna-13b-delta
- For 7B: Download vicuna-7b-delta-v0 and process it:
# fastchat v0.1.10 python3 apply_delta.py \ --base /path/to/model_weights/llama-7b \ --target vicuna-7b-v0 \ --delta CarperAI/vicuna-7b-delta-v0
- Change the
llama_model_path
in config.json or config_7b.json.
-
Download VideoChat-13B or VideoChat-7B:
- Change the
videochat_model_path
in config.jsonor config_7b.json.
- Change the
-
Running demo with Gradio:
python demo.py
-
Another demo on Jupyter Notebook can found in demo.ipynb
If you find this project useful in your research, please consider cite:
@article{2023videochat,
title={VideoChat: Chat-Centric Video Understanding},
author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},
journal={arXiv preprint arXiv:2305.06355},
year={2023}
}
Thanks to the open source of the following projects:
InternVideo, UniFormerV2, MiniGPT-4, LLaVA, BLIP2, StableLM.