Groma: Grounded Multimodal Assistant

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi
Project page (https://groma-mllm.github.io)

Installation

Clone the repository

git clone https://github.com/FoundationVision/Groma.git
cd Groma

Create the conda environment and install dependencies

conda create -n groma python=3.9 -y
conda activate groma
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

cd mmcv
MMCV_WITH_OPS=1 pip install -e .
cd ..

Install falsh-attention for training

pip install ninja
pip install flash-attn --no-build-isolation

Model Weights

To play with Groma, please download the model weights from huggingface.

We additionally provide pretrained checkpoints from intermediate training stages. You can start from any point to customize training.

Training stage	Required checkpoints
Detection pretraining	DINOv2-L
Alignment pretraining	Vicuna-7b-v1.5, Groma-det-pretrain
Instruction finetuning	Groma-7b-pretrain

Prepare Data

We provide instructions to download datasets used at different training stages of Groma, including Groma Instruct, a 30k viusally grounded conversation dataset constructed with GPT-4V. You don't have to download all of them unless you want to train Groma from scratch. Please follow instructions in DATA.md to prepare datasets.

Training stage	Data types	Datasets
Detection pretraining	Detection	COCO, Objects365, OpenImages, V3Det, SA1B
Alignment pretraining	Image caption	ShareGPT-4V-PT
	Grounded caption	Flickr30k Entities
	Region caption	Visual Genome, RefCOCOg
	REC	COCO, RefCOCO/g/+, Grit-20m
Instruction finetuning	Grounded caption	Flickr30k Entities
	Region caption	Visual Genome, RefCOCOg
	REC	COCO, RefCOCO/g/+
	Instruction following	Groma Instruct, LLaVA Instruct, ShareGPT-4V

Training

For detection pretraining, please run

bash scripts/det_pretrain.sh {path_to_dinov2_ckpt} {output_dir}

For alignment pretraining, please run

bash scripts/vl_pretrain.sh {path_to_vicuna_ckpt} {path_to_groma_det_pretrain_ckpt} {output_dir}

For instruction finetuing, please run

bash scripts/vl_finetune.sh {path_to_groma_7b_pretrain_ckpt} {output_dir}

Inference

To test on single image, you can run

python -m llava.eval.run_groma \
    --model-name {path_to_groma_7b_finetune} \
    --image-file {path_to_img} \
    --query {user_query}

Evaluation

For evaluation, please refer to EVAL.md for more details.

Acknowledgement

Groma is built upon the awesome works LLaVA and GPT4ROI.

LICENSE

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@misc{Groma,
      title={Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models}, 
      author={Chuofan Ma and Yi Jiang and Jiannan Wu and Zehuan Yuan and Xiaojuan Qi},
      year={2024},
      eprint={2404.13013},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
groma		groma
mmcv		mmcv
mmdet		mmdet
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Groma: Grounded Multimodal Assistant

Contents

Installation

Model Weights

Prepare Data

Training

Inference

Evaluation

Acknowledgement

LICENSE

Citation

About

Releases

Packages

Languages

License

CVMI-Lab/Groma

Folders and files

Latest commit

History

Repository files navigation

Groma: Grounded Multimodal Assistant

Contents

Installation

Model Weights

Prepare Data

Training

Inference

Evaluation

Acknowledgement

LICENSE

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages