Currently, this is a simple extension of MiniGPT-4 without extra training. We try to undermine its ability for video understanding with simple prompt design.
- 2023/04/19: Simple extension release
We simple encode 4 frames and use the time-sensitive prompt as follows:
However, without video-text instruction finetuning, it's difficult to answer those questions about the time.
"First, <Img><ImageHere></Img>. Then, <Img><ImageHere></Img>. " "After that, <Img><ImageHere></Img>. Finally, <Img><ImageHere></Img>. "
Please follow the instrction in MiniGPT-4 to prepare the environment.
- Prepare the envirment.
conda env create -f environment.yml conda activate minigpt4
- Download BLIP2 model:
- ViT:
wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth
- QFormer:
wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
- Change the
vit_model_path
andq_former_model_path
in minigpt4.yaml.
- ViT:
- Download Vicuna model:
- LLAMA: Download it from the original repo or hugging face.
- If you download LLAMA from the original repo, please process it via the following command:
# convert_llama_weights_to_hf is copied from transformers python src/transformers/models/llama/convert_llama_weights_to_hf.py \ --input_dir /path/to/downloaded/llama/weights \ --model_size 7B --output_dir /output/path
- Download Vicuna-13b-deelta-v0 and process it:
# fastchat v0.1.10 python3 -m fastchat.model.apply_delta \ --base /path/to/llama-13b \ --target /output/path/to/vicuna-13b \ --delta lmsys/vicuna-13b-delta-v0
- Change the
llama_model
in minigpt4.yaml.
- Download MiniGPT-4 model:
- Linear layer can be downloaded here.
- Change the
ckpt
in minigpt4_eval.yaml.
- Running demo:
python demo_video.py --cfg-path eval_configs/minigpt4_eval.yaml
This project is mainly based on MiniGPT-4, which is support by Lavis, Vicuna and BLIP2. Thanks for these amazing projects!