[WIP] Support VLM evaluation (#82)

* add vlm eval kit * remove sys path * [wip] use funcation call as default * add swift example * add eval local model * add log, update config * update summarizer * update summarizer * add CustomAPIModel support * add install options * rename config * add unittest for vlmeval * update README.md and configs * update README.md, add more log
modelscope · Jul 26, 2024 · d15017a · d15017a
1 parent 639198a
commit d15017a
Show file tree

Hide file tree

Showing 15 changed files with 669 additions and 64 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,8 @@
 # Byte-compiled / optimized / DLL files
+output/
+test*.sh
+test.ipynb
+*.ttf
 __pycache__/
 *.py[cod]
 *$py.class

diff --git a/README.md b/README.md
@@ -1,6 +1,22 @@
 English | [简体中文](README_zh.md)
 
-## Introduction
+<p align="center">
+<a href="https://pypi.org/project/llmuses"><img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/llmuses">
+</a>
+<a href="https://github.com/modelscope/eval-scope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
+<p>
+
+## 📖 Table of Content
+- [Introduction](#introduction)
+- [News](#News)
+- [Installation](#installation)
+- [Quick Start](#quick-start)
+- [Dataset List](#datasets-list)
+- [Leaderboard](#leaderboard)
+- [Experiments and Results](#Experiments-and-Results)
+- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)
+
+## 📝 Introduction
 
 Large Language Model (LLMs) evaluation has become a critical process for assessing and improving LLMs. To better support the evaluation of large models, we propose the Eval-Scope framework, which includes the following components and features:
 
@@ -15,14 +31,15 @@ Large Language Model (LLMs) evaluation has become a critical process for assessi
 - Visualization tools
 - Model Inference Performance Evaluation [Tutorial](llmuses/perf/README.md)
 - Support for OpenCompass as an Evaluation Backend, featuring advanced encapsulation and task simplification to easily submit tasks to OpenCompass for evaluation.
+- Supports VLMEvalKit as the evaluation backend. It initiates VLMEvalKit's multimodal evaluation tasks through Eval-Scope, supporting various multimodal models and datasets.
 - Full pipeline support: Seamlessly integrate with SWIFT to easily train and deploy model services, initiate evaluation tasks, view evaluation reports, and achieve an end-to-end large model development process.
 
 
-Features
+**Features**
 - Lightweight, minimizing unnecessary abstractions and configurations
 - Easy to customize
   - New datasets can be integrated by simply implementing a single class
-  - Models can be hosted on ModelScope, and evaluations can be initiated with just a model id
+  - Models can be hosted on [ModelScope](https://modelscope.cn), and evaluations can be initiated with just a model id
   - Supports deployment of locally hosted models
 - Visualization of evaluation reports
 - Rich evaluation metrics
@@ -31,14 +48,15 @@ Features
   - Pairwise-baseline mode: Comparison with baseline models
   - Pairwise (all) mode: Pairwise comparison of all models
 
-## News
-- **\[2024.06.29\]** The OpenCompass evaluation backend has been integrated into Eval-Scope, allowing users to easily submit tasks to OpenCompass for evaluation. 🔥🔥🔥
-- **\[2024.06.13\]** Eval-Scope has been updated to version 0.3.x, which supports the ModelScope SWIFT framework for LLMs evaluation. 🚀🚀🚀
-- **\[2024.06.13\]** We have supported the ToolBench as a third-party evaluation backend for Agents evaluation. 🚀🚀🚀
+## 🎉 News
+- **[2024.07.26]:** Supports **VLMEvalKit** as a third-party evaluation framework, initiating multimodal model evaluation tasks. [User Guide](#vlmevalkit-evaluation-backend) 🔥🔥🔥
+- **[2024.06.29]:** Supports **OpenCompass** as a third-party evaluation framework. We have provided a high-level wrapper, supporting installation via pip and simplifying the evaluation task configuration. [User Guide](#opencompass-evaluation-backend) 🔥🔥🔥
+- **[2024.06.13]** Eval-Scope has been updated to version 0.3.x, which supports the ModelScope SWIFT framework for LLMs evaluation. 🚀🚀🚀
+- **[2024.06.13]** We have supported the ToolBench as a third-party evaluation backend for Agents evaluation. 🚀🚀🚀
 
 
 
-## Installation
+## 🛠️ Installation
 ### Install with pip
 1. create conda environment
 ```shell
@@ -64,7 +82,7 @@ pip install -e .
 ```
 
 
-## Quick Start
+## 🚀 Quick Start
 
 ### Simple Evaluation
 command line with pip installation:
@@ -82,7 +100,7 @@ Parameters:
 ```shell
 python llmuses/run.py --model ZhipuAI/chatglm3-6b --template-type chatglm3 --model-args revision=v1.0.2,precision=torch.float16,device_map=auto --datasets mmlu ceval --use-cache true --limit 10
 ```
-```
+```shell
 python llmuses/run.py --model qwen/Qwen-1_8B --generation-config do_sample=false,temperature=0.0 --datasets ceval --dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' --limit 10
 ```
 Parameters:
@@ -104,27 +122,25 @@ print(TemplateType.get_template_name_list())
 
 ### Evaluation Backend
 Eval-Scope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
-- **Native**: Eval-Scope：Eval-Scope自身的评测框架，支持多种评估模式，包括单模型评估、竞技场模式、Baseline模型对比模式等。
-- [OpenCompass](https://github.com/open-compass/opencompass)：Which is a popular evaluation framework for large language models, Eval-Scope supports submitting tasks to OpenCompass with `pip install ms-opencompass`.
+- **Native**: Eval-Scope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
+- [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through Eval-Scope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework [ModelScope Swift](https://github.com/modelscope/swift).
+- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): Initiate VLMEvalKit multimodal evaluation tasks through Eval-Scope. Supports various multimodal models and datasets, and offers seamless integration with the LLM fine-tuning framework [ModelScope Swift](https://github.com/modelscope/swift).
 - **ThirdParty**: The third-party task, e.g. [ToolBench](llmuses/thirdparty/toolbench/README.md), you can contribute your own evaluation task to Eval-Scope as third-party backend.
 
-#### 1. OpenCompass Eval-Backend
+#### OpenCompass Eval-Backend
 
-To facilitate the OpenCompass as an evaluation backend, we have customized the OpenCompass codebase and named it `ms-opencompass`. This version enhances the configuration and execution of evaluation tasks based on the original version and supports installation via PyPI, allowing users to initiate lightweight OpenCompass evaluation tasks through Eval-Scope.
+To facilitate the use of the OpenCompass evaluation backend, we have customized the OpenCompass source code and named it `ms-opencompass`. This version includes optimizations for evaluation task configuration and execution based on the original version, and it supports installation via PyPI. This allows users to initiate lightweight OpenCompass evaluation tasks through Eval-Scope. Additionally, we have initially opened up API-based evaluation tasks in the OpenAI API format. You can deploy model services using [ModelScope Swift](https://github.com/modelscope/swift), where [swift deploy](https://swift.readthedocs.io/en/latest/LLM/VLLM-inference-acceleration-and-deployment.html) supports using vLLM to launch model inference services.
 
 
 ##### Installation
 ```shell
-# Install eval-scope
-pip install llmuses>=0.4.0
-
-# Install ms-opencompass
-pip install ms-opencompass
+# Install with extra option
+pip install llmuses[opencompass]
 ```
 
-#### Data Preparation
+##### Data Preparation
 Available datasets from OpenCompass backend:
-```python
+```text
 'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
 ```
 Refer to [OpenCompass datasets](https://hub.opencompass.org.cn/home)
@@ -149,8 +165,8 @@ Dataset download:
 Unzip the file and set the path to the `data` directory in current work directory.
 
 
-#### Model serving
-We use ModelScope swift to deploy model services, see: [ModelScope swift](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md)
+##### Model Serving
+We use ModelScope swift to deploy model services, see: [ModelScope Swift](hhttps://swift.readthedocs.io/en/latest/LLM/VLLM-inference-acceleration-and-deployment.html)
 ```shell
 # Install ms-swift
 pip install ms-swift
@@ -160,13 +176,89 @@ CUDA_VISIBLE_DEVICES=0 swift deploy --model_type llama3-8b-instruct --port 8000
 ```
 
 
-#### Model evaluation
+##### Model Evaluation
 
 Refer to example: [example_eval_swift_openai_api](examples/example_eval_swift_openai_api.py) to configure and execute the evaluation task:
 ```shell
 python examples/example_eval_swift_openai_api.py
 ```
 
+#### VLMEvalKit Evaluation Backend
+
+To facilitate the use of the VLMEvalKit evaluation backend, we have customized the VLMEvalKit source code and named it `ms-vlmeval`. This version encapsulates the configuration and execution of evaluation tasks based on the original version and supports installation via PyPI, allowing users to initiate lightweight VLMEvalKit evaluation tasks through Eval-Scope. Additionally, we support API-based evaluation tasks in the OpenAI API format. You can deploy multimodal model services using ModelScope [swift](https://github.com/modelscope/swift).
+
+##### Installation
+```shell
+# Install with additional options
+pip install llmuses[vlmeval]
+```
+
+##### Data Preparation
+Currently supported datasets include:
+```text
+'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL', 'DocVQA_TEST', 'InfoVQA_ VAL', 'InfoVQA_TEST', 'ChartQA_VAL', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI', 'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST', 'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500', 'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_Z H_EASY_ALL', 'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMBench-Video', 'Video-MME', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MM Bench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
+```
+For detailed information about the datasets, please refer to [VLMEvalKit Supported Multimodal Evaluation Sets](https://github.com/open-compass/VLMEvalKit/tree/main#-datasets-models-and-evaluation-results).
+
+You can use the following to view the list of dataset names:
+```python
+from llmuses.backend.vlm_eval_kit import VLMEvalKitBackendManager
+print(f'** All models from VLMEvalKit backend: {VLMEvalKitBackendManager.list(list_supported_VLMs().keys())}')
+```
+If the dataset file does not exist locally when loading the dataset, it will be automatically downloaded to the `~/LMUData/` directory.
+
+
+##### Model Evaluation
+There are two ways to evaluate the model:
+
+###### 1. ModelScope Swift Deployment for Model Evaluation
+**Model Deployment**
+Deploy the model service using ModelScope Swift. For detailed instructions, refer to: [ModelScope Swift MLLM Deployment Guide](https://swift.readthedocs.io/en/latest/Multi-Modal/mutlimodal-deployment.html)
+```shell
+# Install ms-swift
+pip install ms-swift
+# Deploy the qwen-vl-chat multi-modal model service
+CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-vl-chat --model_id_or_path models/Qwen-VL-Chat
+```
+**Model Evaluation**
+Refer to the example file: [example_eval_vlm_swift](examples/example_eval_vlm_swift.py) to configure the evaluation task.
+Execute the evaluation task:
+```shell
+python examples/example_eval_vlm_swift.py
+```
+
+###### 2. Local Model Inference Evaluation
+**Model Inference Evaluation**
+Skip the model service deployment and perform inference directly on the local machine. Refer to the example file: [example_eval_vlm_local](examples/example_eval_vlm_local.py) to configure the evaluation task.
+Execute the evaluation task:
+```shell
+python examples/example_eval_vlm_local.py
+```
+
+
+##### (Optional) Deploy Judge Model
+Deploy the local language model as a judge/extractor using ModelScope swift. For details, refer to: [ModelScope Swift LLM Deployment Guide](https://swift.readthedocs.io/en/latest/LLM/VLLM-inference-acceleration-and-deployment.html). If no judge model is deployed, exact matching will be used.
+
+```shell
+# Deploy qwen2-7b as a judge
+CUDA_VISIBLE_DEVICES=1 swift deploy --model_type qwen2-7b-instruct --model_id_or_path models/Qwen2-7B-Instruct --port 8866
+```
+
+You **must configure the following environment variables for the judge model to be correctly invoked**:
+```
+OPENAI_API_KEY=EMPTY
+OPENAI_API_BASE=http://127.0.0.1:8866/v1/chat/completions # api_base for the judge model
+LOCAL_LLM=qwen2-7b-instruct # model_id for the judge model
+```
+
+##### Model Evaluation
+Refer to the example file: [example_eval_vlm_swift](examples/example_eval_vlm_swift.py) to configure the evaluation task.
+
+Execute the evaluation task:
+
+```shell
+python examples/example_eval_vlm_swift.py
+```
 
 
 ### Local Dataset