- We evaluate various models including LLMs and LMMs. - In each type, we consider both closed- and open-source models. - Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark. - For all models, we use the default prompt provided by each model for multi-choice or open QA, if available. - If models do not provide prompts for task types in MMMU, we conduct prompt engineering on the validation set and use the most effective prompt for the later zero-shot experiment. + + + + + + + + + + + + -
-Reset | -Overall | -Art & Design | -Business | -Science | -Health & Medicine | -Human. & Social Sci. | -Tech & Eng. | -
- - GPT-4V(ision) (Playground) - - | -55.7 | -65.3 | -64.3 | -48.4 | -63.5 | -76.3 | -41.7 | -
- - Qwen-VL-PLUS* - - | -40.8 | -59.9 | -34.5 | -32.8 | -43.7 | -65.5 | -32.9 | -
- - BLIP-2 FLAN-T5-XXL - - | - -34.0 | -49.2 | -28.6 | -27.3 | -33.7 | -51.5 | -30.4 | -
- - InstructBLIP-T5-XXL - - | - -33.8 | -48.5 | -30.6 | -27.6 | -33.6 | -49.8 | -29.4 | -
- - LLaVA-1.5-13B - - | - -33.6 | -49.8 | -28.2 | -25.9 | -34.9 | -54.7 | -28.3 | -
- - Qwen-VL-7B - - | - -32.9 | -47.7 | -29.8 | -25.6 | -33.6 | -45.3 | -30.2 | -
- - mPLUG-OWL2* - - | - -32.1 | -48.5 | -25.6 | -24.9 | -32.8 | -46.7 | -29.6 | -
- - BLIP-2 FLAN-T5-XL - - | - -31.0 | -43.0 | -25.6 | -25.1 | -31.8 | -48.0 | -27.8 | -
- - InstructBLIP-T5-XL - - | - -30.6 | -43.3 | -25.2 | -25.2 | -29.3 | -45.8 | -28.6 | -
- - CogVLM - - | - -30.1 | -38.0 | -25.6 | -25.1 | -31.2 | -41.5 | -28.9 | -
- - Otter - - | - -29.1 | -37.4 | -24.0 | -24.1 | -29.6 | -35.9 | -30.2 | -
- - LLaMA-Adapter2-7B - - | - -27.7 | -35.2 | -25.4 | -25.6 | -30.0 | -29.1 | -25.7 | -
- - MiniGPT4-Vicuna-13B - - | - -27.6 | -30.2 | -27.0 | -26.2 | -26.9 | -30.9 | -27.2 | -
- - Fuyu-8B - - | - -27.4 | -29.9 | -27.0 | -25.6 | -27.0 | -32.5 | -26.4 | -
- - Kosmos2 - - | - -26.6 | -28.8 | -23.7 | -26.6 | -27.2 | -26.3 | -26.8 | -
- - OpenFlamingo2-9B - - | - -26.3 | -31.7 | -23.5 | -26.3 | -26.3 | -27.9 | -25.1 | -
Frequent Choice | -25.8 | -26.7 | -28.4 | -24.0 | -24.4 | -25.2 | -26.5 | -
Random Choice | -23.9 | -24.1 | -24.9 | -21.6 | -25.3 | -22.8 | -24.8 | -
Overall results of different models on the MMMU test set. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.
-