a tiny vision language model that kicks ass and runs anywhere
1.6B parameter model built using SigLIP, Phi-1.5 and the LLaVA training dataset. Weights are licensed under CC-BY-SA due to using the LLaVA dataset. Try it out on Hugging Face Spaces!
Benchmarks
Model | Parameters | VQAv2 | GQA | VizWiz | TextVQA |
---|---|---|---|---|---|
LLaVA-1.5 | 13.3B | 80.0 | 63.3 | 53.6 | 61.3 |
LLaVA-1.5 | 7.3B | 78.5 | 62.0 | 50.0 | 58.2 |
MC-LLaVA-3B | 3B | 64.2 | 49.6 | 24.9 | 38.6 |
LLaVA-Phi | 3B | 71.4 | - | 35.9 | 48.6 |
moondream1 | 1.6B | 74.3 | 56.3 | 30.3 | 39.8 |
Examples
Usage
Clone this repository and install the dependencies:
pip install -r requirements.txt
Use the sample.py
script to run the model on CPU:
python sample.py --image [IMAGE_PATH] --prompt [PROMPT]
When the --prompt
argument is not provided, the script will allow you to ask
questions interactively.
Gradio demo
Use the gradio_demo.py
script to run the gradio app:
python gradio_demo.py
Limitations
- The model may generate inaccurate statements.
- It may struggle to adhere to intricate or nuanced instructions.
- It is primarily designed to understand English. Informal English, slang, and non-English languages may not work well.
- The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
- The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.