Skip to content

Commit

Permalink
Merge branch 'package'
Browse files Browse the repository at this point in the history
  • Loading branch information
Ki-Seki committed Oct 3, 2024
2 parents cc6d76b + 0568d24 commit a377daf
Show file tree
Hide file tree
Showing 64 changed files with 104 additions and 66 deletions.
1 change: 1 addition & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
PYTHONPATH=src
2 changes: 1 addition & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ We appreciate your interest in contributing. To ensure a smooth collaboration, p
> Please ensure that your code passes all tests and `black` code formatting before opening a pull request.
> You can run the following commands to check your code:
> ```bash
> python -m unittest discover -s tests/ -p 'test*.py' -v
> PYTHONPATH=src python -m unittest discover -s tests/ -p 'test*.py' -v
> black . --check
> ```
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Test with unittest
run: |
python -m unittest discover -s tests/ -p 'test*.py' -v
PYTHONPATH=src python -m unittest discover -s tests/ -p 'test*.py' -v
- name: Test linting with black
run: |
black . --check
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
/.vscode/
/output/
/dist/

__pycache__/

Expand Down
40 changes: 17 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

<p align="center">
<i>What does this repository include?</i><br>
<b><a href="./eval/benchs/uhgeval/">UHGEval</a></b>: An unconstrained hallucination evaluation benchmark.<br>
<b><a href="./eval/">Eval Suite</a></b>: A user-friendly evaluation framework for hallucination tasks.<br>
<b><a href="./src/eval_suite/benchs/uhgeval/">UHGEval</a></b>: An unconstrained hallucination evaluation benchmark.<br>
<b><a href="./src/eval_suite/">Eval Suite</a></b>: A user-friendly evaluation framework for hallucination tasks.<br>
Eval Suite supports other benchmarks, such as <a href="https://github.com/OpenMOSS/HalluQA">HalluQA</a> and <a href="https://github.com/RUCAIBox/HaluEval">HaluEval</a>.
</p>

Expand All @@ -31,36 +31,32 @@
## Quick Start

```bash
# Clone the repository
git clone https://github.com/IAAR-Shanghai/UHGEval.git
cd UHGEval

# Install dependencies
# Install Eval Suite
conda create -n uhg python=3.10
conda activate uhg
pip install -r requirements.txt
pip install eval-suite

# Run evaluations with OpenAI Compatible API
python -m eval.cli eval openai \
eval_suite eval openai \
--model_name gpt-4o \
--api_key your_api_key \
--base_url https://api.openai.com/v1 \
--evaluators ExampleQAEvaluator UHGSelectiveEvaluator

# Or run evaluations with Hugging Face Transformers
python -m eval.cli eval huggingface \
eval_suite eval huggingface \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--apply_chat_template \
--evaluators ExampleQAEvaluator UHGSelectiveEvaluator

# After evaluation, you can gather statistics of the evaluation results
python -m eval.cli stat
eval_suite stat

# List all available evaluators
python -m eval.cli list
eval_suite list

# Get help
python -m eval.cli --help
eval_suite --help
```

> [!Tip]
Expand Down Expand Up @@ -113,13 +109,13 @@ UHGEval is a large-scale benchmark designed for evaluating hallucination in prof

To facilitate evaluation, we have developed a user-friendly evaluation framework called Eval Suite. Currently, Eval Suite supports common hallucination evaluation benchmarks, allowing for comprehensive evaluation of the same LLM with just one command as shown in the [Quick Start](#quick-start) section.

| Benchmark | Evaluator | More Information |
| --------- | -------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| C-Eval | `CEvalEvaluator` | [eval/benchs/ceval](eval/benchs/ceval) |
| ExampleQA | `ExampleQAEvaluator` | [eval/benchs/exampleqa](eval/benchs/exampleqa) |
| HalluQA | `HalluQAMCEvaluator` | [eval/benchs/halluqa](eval/benchs/halluqa) |
| HaluEval | `HaluEvalDialogEvaluator`<br>`HaluEvalQAEvaluator`<br>`HaluEvalSummaEvaluator` | [eval/benchs/halueval](eval/benchs/halueval) |
| UHGEval | `UHGDiscKeywordEvaluator`<br>`UHGDiscSentenceEvaluator`<br>`UHGGenerativeEvaluator`<br>`UHGSelectiveEvaluator` | [eval/benchs/uhgeval](eval/benchs/uhgeval) |
| Benchmark | Evaluator | More Information |
| --------- | -------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| C-Eval | `CEvalEvaluator` | [src/eval_suite/benchs/ceval](src/eval_suite/benchs/ceval) |
| ExampleQA | `ExampleQAEvaluator` | [src/eval_suite/benchs/exampleqa](src/eval_suite/benchs/exampleqa) |
| HalluQA | `HalluQAMCEvaluator` | [src/eval_suite/benchs/halluqa](src/eval_suite/benchs/halluqa) |
| HaluEval | `HaluEvalDialogEvaluator`<br>`HaluEvalQAEvaluator`<br>`HaluEvalSummaEvaluator` | [src/eval_suite/benchs/halueval](src/eval_suite/benchs/halueval) |
| UHGEval | `UHGDiscKeywordEvaluator`<br>`UHGDiscSentenceEvaluator`<br>`UHGGenerativeEvaluator`<br>`UHGSelectiveEvaluator` | [src/eval_suite/benchs/uhgeval](src/eval_suite/benchs/uhgeval) |

## Learn More

Expand Down Expand Up @@ -162,8 +158,6 @@ To facilitate evaluation, we have developed a user-friendly evaluation framework
<details><summary>Click me to show all TODOs</summary>

- [ ] feat: vLLM offline inference benchmarking
- [ ] build: packaging
- [ ] feat(benchs): add TruthfulQA benchmark
- [ ] other: promotion

- [ ] ci: auto release to PyPI
</details>
10 changes: 7 additions & 3 deletions demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,13 @@
"metadata": {},
"outputs": [],
"source": [
"from eval.benchs import ExampleQAEvaluator, get_all_evaluator_classes, load_evaluator\n",
"from eval.llms import HuggingFace, OpenAIAPI\n",
"from eval.utils import save_stats"
"from eval_suite.benchs import (\n",
" ExampleQAEvaluator,\n",
" get_all_evaluator_classes,\n",
" load_evaluator,\n",
")\n",
"from eval_suite.llms import HuggingFace, OpenAIAPI\n",
"from eval_suite.utils import save_stats"
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions docs/add-bench-or-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Adding a New Benchmark

You can refer to the structure of the `eval/benchs/exampleqa` folder, which serves as a minimal benchmark example. Additionally, you might want to check the `eval/benchs/base_dataset.py` and `eval/benchs/base_evaluator.py` files, as they provide the base classes for benchmarks.
You can refer to the structure of the `src/eval_suite/benchs/exampleqa` folder, which serves as a minimal benchmark example. Additionally, you might want to check the `src/eval_suite/benchs/base_dataset.py` and `src/eval_suite/benchs/base_evaluator.py` files, as they provide the base classes for benchmarks.

1. **Creating a Benchmark Folder**
- Create a new folder under the `benchs` directory.
Expand Down Expand Up @@ -33,7 +33,7 @@ You can refer to the structure of the `eval/benchs/exampleqa` folder, which serv

## Adding a New Model Loader

You can refer to the `eval/llms/huggingface.py` and `eval/llms/openai_api.py` files as examples for loading LLMs.
You can refer to the `src/eval_suite/llms/huggingface.py` and `src/eval_suite/llms/openai_api.py` files as examples for loading LLMs.

1. **Language Model Loader**
- Create a new file under the `llms` directory.
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ A base evaluator and dataset under `benchs` provide default evaluation logic and
## Structure

```bash
eval
src/eval_suite/
├── __init__.py
├── cli.py # Command line interface
├── logging.py # Global logging configuration
Expand Down
4 changes: 2 additions & 2 deletions docs/experiments/20240822/expt.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
from eval.benchs import (
from eval_suite.benchs import (
UHGDiscKeywordEvaluator,
UHGDiscSentenceEvaluator,
UHGGenerativeEvaluator,
UHGSelectiveEvaluator,
)
from eval.llms import OpenAIAPI
from eval_suite.llms import OpenAIAPI

glm = OpenAIAPI(
model_name="THUDM/glm-4-9b-chat",
Expand Down
58 changes: 58 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
[build-system]
requires = ["hatchling", "hatch-vcs"]
build-backend = "hatchling.build"

[project]
name = "eval_suite"
dependencies = [
# Common
"torch",
"tqdm",
"ipykernel",

# OpenAI API
"openai",
"tenacity",

# Hugging Face Transformers
"transformers",
"accelerate",
"sentencepiece",

# Metrics
"nltk",
"rouge_score",
"text2vec",
"absl-py",

# Formatting
"black",
"isort",
]
authors = [{ name = "Shichao Song", email = "[email protected]" }]
description = "User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc."
license = { file = "LICENSE" }
keywords = [
"UHGEval",
"Chinese",
"hallucination",
"evaluation",
"llm",
"eval_suite",
]
requires-python = ">=3.10"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
]
dynamic = ["readme", "version"]

[project.urls]
Repository = "https://github.com/IAAR-Shanghai/UHGEval"

[project.scripts]
eval_suite = "eval_suite.cli:main"

[tool.hatch.version]
source = "vcs"
23 changes: 0 additions & 23 deletions requirements.txt

This file was deleted.

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import os

from eval.llms.base_llm import BaseLLM

from ...llms.base_llm import BaseLLM
from ..base_evaluator import BaseEvaluator
from .dataset import ExampleQADataset

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 5 additions & 1 deletion eval/cli.py → src/eval_suite/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def parse_args():
# fmt: on


if __name__ == "__main__":
def main():
args = parse_args()
logger.info(f"Start the CLI with args: {args}")

Expand Down Expand Up @@ -80,3 +80,7 @@ def parse_args():
elif args.operation_name == "list":
print("All evaluators:")
pprint(all_evaluators)


if __name__ == "__main__":
main()
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import unittest

from eval.benchs.base_dataset import DummyDataset
from eval_suite.benchs.base_dataset import DummyDataset


class TestDummyDataset(unittest.TestCase):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
import unittest
from unittest.mock import MagicMock

from eval.benchs.base_evaluator import DummyEvaluator
from eval.llms.base_llm import BaseLLM
from eval_suite.benchs.base_evaluator import DummyEvaluator
from eval_suite.llms.base_llm import BaseLLM


class TestDummyEvaluator(unittest.TestCase):
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import unittest
from unittest.mock import MagicMock

from eval.llms.base_llm import BaseLLM
from eval_suite.llms.base_llm import BaseLLM


class TestBaseLLM(unittest.TestCase):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import torch

from eval.llms.huggingface import HuggingFace
from eval_suite.llms.huggingface import HuggingFace


class TestHuggingFace(unittest.TestCase):
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import unittest

from eval.llms.openai_api import OpenAIAPI
from eval_suite.llms.openai_api import OpenAIAPI


class TestOpenAIAPI(unittest.TestCase):
Expand Down
2 changes: 1 addition & 1 deletion tests/test_metrics.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import unittest

from eval.metrics import bert_score, bleu_4, keyword_precision, rouge_l
from eval_suite.metrics import bert_score, bleu_4, keyword_precision, rouge_l


class TestEvaluationFunctions(unittest.TestCase):
Expand Down

0 comments on commit a377daf

Please sign in to comment.