Kaizhi Zheng* , Xuehai He* , Xin Eric Wang
University of California, Santa Cruz
Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of ``generative vokens", acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks.
1. Download repo and create environment
Clone our repo and create a new python environment.
git clone https://github.com/eric-ai-lab/MiniGPT-5.git
cd MiniGPT-5
conda create -n minigpt5 python=3.10
conda activate minigpt5
pip install -r requirements.txt
2. Prepare the pretrained weights
Our model is based on the pretrained MiniGPT-4 (including Vicuna and BLIP-2). Please download Vicuna V0 7B weights. Then, set the path to the vicuna weight in the model config file at Line 16.
Since the Pretrained MiniGPT-4 Aligned Checkpoint is small, we already download in config folder, and the model path is set in config file at Line 10.
3. Download MiniGPT-5 Checkpoint
Since our model is trained with two stages (Stage 1: Unimodal Alignment Stage, Stage 2: Multimodal Learning Stage), we provide both two-stage checkpoints here:
Stage 1: CC3M | Stage 2: VIST | Stage 2: MMDialog |
---|---|---|
Download | Download | Download |
Stage 2 needs the pretrained weights in Stage 1, so always download Stage 1 weights first.
Please download these weights into a single folder, and we will call this folder as WEIGHT_FOLDER in the following sections.
We provide a python file to try our model. This file will generate multimodal outputs under the example folder by taking a two-turn multimodal inputs.
cd examples
export IS_STAGE2=True
python3 playground.py --stage1_weight WEIGHT_FOLDER/stage1_cc3m.ckpt
--test_weight WEIGHT_FOLDER/stage2_vist.ckpt
Our model evaluate on three datasets: CC3M, VIST, and MMDialog. Due to the license, we only share some dataset examples under the datasets folder. If you want to fully test the performance, please download the full dataset and format into the same data structures under the datasets folder.
1. Stage 1: Unimodal Alignment Stage (CC3M) evaluation
During this stage, the goal is to generate correct images by giving image descriptions.
Generation (If you have more than one gpus, you can set gpus to 0,1,2...):
export IS_STAGE2=False
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/CC3M
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path cc3m_val.tsv
--test_weight stage1_cc3m.ckpt
--gpus 0
Calculate Metric:
export CC3M_FOLDER=datasets/CC3M
python3 metric.py --test_weight stage1_cc3m.ckpt
2. Stage 2: Multimodal Learning Stage (VIST) evaluation
Model will take the previous multimodal story sequences and generate either unimodal or multimodal outputs. Here, the default code is about multimodal input & image generation. To test other settings, please remove the not test condition in Line 280.
Generation:
export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/VIST
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path val_cleaned.json
--test_weight stage2_vist.ckpt
--stage1_weight stage1_cc3m.ckpt
--gpus 0
Calculate Metric:
python3 metric.py --test_weight stage2_vist.ckpt
3. Stage 2: Multimodal Learning Stage (MMDialog) evaluation
Model will take previous turn multimodal inputs and generate multimodal response for multimodal conversations.
Generation:
export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/MMDialog
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path test/test_conversations.txt
--test_weight stage2_mmdialog.ckpt
--stage1_weight stage1_cc3m.ckpt
--gpus 0
Calculate Metric:
python3 metric.py --test_weight stage2_mmdialog.ckpt
1. Stage 1 training
Download the CC3M dataset and format into the same data structure in dataset folder.
Then, we use test data as example:
export IS_STAGE2=False
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/CC3M
python3 train_eval.py --is_training True
--train_data_path cc3m_val.tsv
--val_data_path cc3m_val.tsv
--model_save_name stage1_cc3m_{epoch}-{step}
--gpus 0
2. Stage 2 training
Download the VIST or MMDialog datasets and format into the same data structure in dataset folder.
Here we use VIST test data as example:
export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/VIST
python3 train_eval.py --is_training True
--train_data_path val_cleaned.json
--val_data_path val_cleaned.json
--stage1_weight stage1_cc3m.ckpt
--model_save_name stage2_vist_{epoch}-{step}
--gpus 0
@misc{zheng2023minigpt5,
title={MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens},
author={Kaizhi Zheng and Xuehai He and Xin Eric Wang},
year={2023},
journal={arXiv preprint arXiv:2310.02239}
}