Skip to content

Commit

Permalink
[WIP] Support VLM evaluation (#82)
Browse files Browse the repository at this point in the history
* add vlm eval kit

* remove sys path

* [wip] use funcation call as default

* add swift example

* add eval local model

* add log, update config

* update summarizer

* update summarizer

* add CustomAPIModel support

* add install options

* rename config

* add unittest for vlmeval

* update README.md and configs

* update README.md, add more log
  • Loading branch information
Yunnglin authored Jul 26, 2024
1 parent 639198a commit d15017a
Show file tree
Hide file tree
Showing 15 changed files with 669 additions and 64 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
# Byte-compiled / optimized / DLL files
output/
test*.sh
test.ipynb
*.ttf
__pycache__/
*.py[cod]
*$py.class
Expand Down
140 changes: 116 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,22 @@
English | [简体中文](README_zh.md)

## Introduction
<p align="center">
<a href="https://pypi.org/project/llmuses"><img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/llmuses">
</a>
<a href="https://github.com/modelscope/eval-scope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
<p>

## 📖 Table of Content
- [Introduction](#introduction)
- [News](#News)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Dataset List](#datasets-list)
- [Leaderboard](#leaderboard)
- [Experiments and Results](#Experiments-and-Results)
- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)

## 📝 Introduction

Large Language Model (LLMs) evaluation has become a critical process for assessing and improving LLMs. To better support the evaluation of large models, we propose the Eval-Scope framework, which includes the following components and features:

Expand All @@ -15,14 +31,15 @@ Large Language Model (LLMs) evaluation has become a critical process for assessi
- Visualization tools
- Model Inference Performance Evaluation [Tutorial](llmuses/perf/README.md)
- Support for OpenCompass as an Evaluation Backend, featuring advanced encapsulation and task simplification to easily submit tasks to OpenCompass for evaluation.
- Supports VLMEvalKit as the evaluation backend. It initiates VLMEvalKit's multimodal evaluation tasks through Eval-Scope, supporting various multimodal models and datasets.
- Full pipeline support: Seamlessly integrate with SWIFT to easily train and deploy model services, initiate evaluation tasks, view evaluation reports, and achieve an end-to-end large model development process.


Features
**Features**
- Lightweight, minimizing unnecessary abstractions and configurations
- Easy to customize
- New datasets can be integrated by simply implementing a single class
- Models can be hosted on ModelScope, and evaluations can be initiated with just a model id
- Models can be hosted on [ModelScope](https://modelscope.cn), and evaluations can be initiated with just a model id
- Supports deployment of locally hosted models
- Visualization of evaluation reports
- Rich evaluation metrics
Expand All @@ -31,14 +48,15 @@ Features
- Pairwise-baseline mode: Comparison with baseline models
- Pairwise (all) mode: Pairwise comparison of all models

## News
- **\[2024.06.29\]** The OpenCompass evaluation backend has been integrated into Eval-Scope, allowing users to easily submit tasks to OpenCompass for evaluation. 🔥🔥🔥
- **\[2024.06.13\]** Eval-Scope has been updated to version 0.3.x, which supports the ModelScope SWIFT framework for LLMs evaluation. 🚀🚀🚀
- **\[2024.06.13\]** We have supported the ToolBench as a third-party evaluation backend for Agents evaluation. 🚀🚀🚀
## 🎉 News
- **[2024.07.26]:** Supports **VLMEvalKit** as a third-party evaluation framework, initiating multimodal model evaluation tasks. [User Guide](#vlmevalkit-evaluation-backend) 🔥🔥🔥
- **[2024.06.29]:** Supports **OpenCompass** as a third-party evaluation framework. We have provided a high-level wrapper, supporting installation via pip and simplifying the evaluation task configuration. [User Guide](#opencompass-evaluation-backend) 🔥🔥🔥
- **[2024.06.13]** Eval-Scope has been updated to version 0.3.x, which supports the ModelScope SWIFT framework for LLMs evaluation. 🚀🚀🚀
- **[2024.06.13]** We have supported the ToolBench as a third-party evaluation backend for Agents evaluation. 🚀🚀🚀



## Installation
## 🛠️ Installation
### Install with pip
1. create conda environment
```shell
Expand All @@ -64,7 +82,7 @@ pip install -e .
```


## Quick Start
## 🚀 Quick Start

### Simple Evaluation
command line with pip installation:
Expand All @@ -82,7 +100,7 @@ Parameters:
```shell
python llmuses/run.py --model ZhipuAI/chatglm3-6b --template-type chatglm3 --model-args revision=v1.0.2,precision=torch.float16,device_map=auto --datasets mmlu ceval --use-cache true --limit 10
```
```
```shell
python llmuses/run.py --model qwen/Qwen-1_8B --generation-config do_sample=false,temperature=0.0 --datasets ceval --dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' --limit 10
```
Parameters:
Expand All @@ -104,27 +122,25 @@ print(TemplateType.get_template_name_list())

### Evaluation Backend
Eval-Scope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
- **Native**: Eval-Scope:Eval-Scope自身的评测框架,支持多种评估模式,包括单模型评估、竞技场模式、Baseline模型对比模式等。
- [OpenCompass](https://github.com/open-compass/opencompass):Which is a popular evaluation framework for large language models, Eval-Scope supports submitting tasks to OpenCompass with `pip install ms-opencompass`.
- **Native**: Eval-Scope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
- [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through Eval-Scope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework [ModelScope Swift](https://github.com/modelscope/swift).
- [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): Initiate VLMEvalKit multimodal evaluation tasks through Eval-Scope. Supports various multimodal models and datasets, and offers seamless integration with the LLM fine-tuning framework [ModelScope Swift](https://github.com/modelscope/swift).
- **ThirdParty**: The third-party task, e.g. [ToolBench](llmuses/thirdparty/toolbench/README.md), you can contribute your own evaluation task to Eval-Scope as third-party backend.

#### 1. OpenCompass Eval-Backend
#### OpenCompass Eval-Backend

To facilitate the OpenCompass as an evaluation backend, we have customized the OpenCompass codebase and named it `ms-opencompass`. This version enhances the configuration and execution of evaluation tasks based on the original version and supports installation via PyPI, allowing users to initiate lightweight OpenCompass evaluation tasks through Eval-Scope.
To facilitate the use of the OpenCompass evaluation backend, we have customized the OpenCompass source code and named it `ms-opencompass`. This version includes optimizations for evaluation task configuration and execution based on the original version, and it supports installation via PyPI. This allows users to initiate lightweight OpenCompass evaluation tasks through Eval-Scope. Additionally, we have initially opened up API-based evaluation tasks in the OpenAI API format. You can deploy model services using [ModelScope Swift](https://github.com/modelscope/swift), where [swift deploy](https://swift.readthedocs.io/en/latest/LLM/VLLM-inference-acceleration-and-deployment.html) supports using vLLM to launch model inference services.


##### Installation
```shell
# Install eval-scope
pip install llmuses>=0.4.0

# Install ms-opencompass
pip install ms-opencompass
# Install with extra option
pip install llmuses[opencompass]
```

#### Data Preparation
##### Data Preparation
Available datasets from OpenCompass backend:
```python
```text
'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
```
Refer to [OpenCompass datasets](https://hub.opencompass.org.cn/home)
Expand All @@ -149,8 +165,8 @@ Dataset download:
Unzip the file and set the path to the `data` directory in current work directory.


#### Model serving
We use ModelScope swift to deploy model services, see: [ModelScope swift](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md)
##### Model Serving
We use ModelScope swift to deploy model services, see: [ModelScope Swift](hhttps://swift.readthedocs.io/en/latest/LLM/VLLM-inference-acceleration-and-deployment.html)
```shell
# Install ms-swift
pip install ms-swift
Expand All @@ -160,13 +176,89 @@ CUDA_VISIBLE_DEVICES=0 swift deploy --model_type llama3-8b-instruct --port 8000
```


#### Model evaluation
##### Model Evaluation

Refer to example: [example_eval_swift_openai_api](examples/example_eval_swift_openai_api.py) to configure and execute the evaluation task:
```shell
python examples/example_eval_swift_openai_api.py
```

#### VLMEvalKit Evaluation Backend

To facilitate the use of the VLMEvalKit evaluation backend, we have customized the VLMEvalKit source code and named it `ms-vlmeval`. This version encapsulates the configuration and execution of evaluation tasks based on the original version and supports installation via PyPI, allowing users to initiate lightweight VLMEvalKit evaluation tasks through Eval-Scope. Additionally, we support API-based evaluation tasks in the OpenAI API format. You can deploy multimodal model services using ModelScope [swift](https://github.com/modelscope/swift).

##### Installation
```shell
# Install with additional options
pip install llmuses[vlmeval]
```

##### Data Preparation
Currently supported datasets include:
```text
'COCO_VAL', 'MME', 'HallusionBench', 'POPE', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MMBench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK', 'OCRVQA_TEST', 'OCRVQA_TESTCORE', 'TextVQA_VAL', 'DocVQA_VAL', 'DocVQA_TEST', 'InfoVQA_ VAL', 'InfoVQA_TEST', 'ChartQA_VAL', 'ChartQA_TEST', 'MathVision', 'MathVision_MINI', 'MMMU_DEV_VAL', 'MMMU_TEST', 'OCRBench', 'MathVista_MINI', 'LLaVABench', 'MMVet', 'MTVQA_TEST', 'MMLongBench_DOC', 'VCR_EN_EASY_500', 'VCR_EN_EASY_100', 'VCR_EN_EASY_ALL', 'VCR_EN_HARD_500', 'VCR_EN_HARD_100', 'VCR_EN_HARD_ALL', 'VCR_ZH_EASY_500', 'VCR_ZH_EASY_100', 'VCR_Z H_EASY_ALL', 'VCR_ZH_HARD_500', 'VCR_ZH_HARD_100', 'VCR_ZH_HARD_ALL', 'MMBench-Video', 'Video-MME', 'MMBench_DEV_EN', 'MMBench_TEST_EN', 'MMBench_DEV_CN', 'MMBench_TEST_CN', 'MMBench', 'MMBench_CN', 'MMBench_DEV_EN_V11', 'MMBench_TEST_EN_V11', 'MMBench_DEV_CN_V11', 'MMBench_TEST_CN_V11', 'MM Bench_V11', 'MMBench_CN_V11', 'SEEDBench_IMG', 'SEEDBench2', 'SEEDBench2_Plus', 'ScienceQA_VAL', 'ScienceQA_TEST', 'MMT-Bench_ALL_MI', 'MMT-Bench_ALL', 'MMT-Bench_VAL_MI', 'MMT-Bench_VAL', 'AesBench_VAL', 'AesBench_TEST', 'CCBench', 'AI2D_TEST', 'MMStar', 'RealWorldQA', 'MLLMGuard_DS', 'BLINK'
```
For detailed information about the datasets, please refer to [VLMEvalKit Supported Multimodal Evaluation Sets](https://github.com/open-compass/VLMEvalKit/tree/main#-datasets-models-and-evaluation-results).

You can use the following to view the list of dataset names:
```python
from llmuses.backend.vlm_eval_kit import VLMEvalKitBackendManager
print(f'** All models from VLMEvalKit backend: {VLMEvalKitBackendManager.list(list_supported_VLMs().keys())}')
```
If the dataset file does not exist locally when loading the dataset, it will be automatically downloaded to the `~/LMUData/` directory.


##### Model Evaluation
There are two ways to evaluate the model:

###### 1. ModelScope Swift Deployment for Model Evaluation
**Model Deployment**
Deploy the model service using ModelScope Swift. For detailed instructions, refer to: [ModelScope Swift MLLM Deployment Guide](https://swift.readthedocs.io/en/latest/Multi-Modal/mutlimodal-deployment.html)
```shell
# Install ms-swift
pip install ms-swift
# Deploy the qwen-vl-chat multi-modal model service
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-vl-chat --model_id_or_path models/Qwen-VL-Chat
```
**Model Evaluation**
Refer to the example file: [example_eval_vlm_swift](examples/example_eval_vlm_swift.py) to configure the evaluation task.
Execute the evaluation task:
```shell
python examples/example_eval_vlm_swift.py
```

###### 2. Local Model Inference Evaluation
**Model Inference Evaluation**
Skip the model service deployment and perform inference directly on the local machine. Refer to the example file: [example_eval_vlm_local](examples/example_eval_vlm_local.py) to configure the evaluation task.
Execute the evaluation task:
```shell
python examples/example_eval_vlm_local.py
```


##### (Optional) Deploy Judge Model
Deploy the local language model as a judge/extractor using ModelScope swift. For details, refer to: [ModelScope Swift LLM Deployment Guide](https://swift.readthedocs.io/en/latest/LLM/VLLM-inference-acceleration-and-deployment.html). If no judge model is deployed, exact matching will be used.

```shell
# Deploy qwen2-7b as a judge
CUDA_VISIBLE_DEVICES=1 swift deploy --model_type qwen2-7b-instruct --model_id_or_path models/Qwen2-7B-Instruct --port 8866
```

You **must configure the following environment variables for the judge model to be correctly invoked**:
```
OPENAI_API_KEY=EMPTY
OPENAI_API_BASE=http://127.0.0.1:8866/v1/chat/completions # api_base for the judge model
LOCAL_LLM=qwen2-7b-instruct # model_id for the judge model
```
##### Model Evaluation
Refer to the example file: [example_eval_vlm_swift](examples/example_eval_vlm_swift.py) to configure the evaluation task.
Execute the evaluation task:
```shell
python examples/example_eval_vlm_swift.py
```


### Local Dataset
Expand Down
Loading

0 comments on commit d15017a

Please sign in to comment.