Skip to content
This repository has been archived by the owner on Sep 24, 2024. It is now read-only.

Davide/tutorial #97

Merged
merged 8 commits into from
Apr 23, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 49 additions & 1 deletion docs/source/evaluation_concepts.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,54 @@
Evaluation
====================================

All evaluation is currently done via [EleutherAI's lm-evaluation-harness package](https://github.com/EleutherAI/lm-evaluation-harness) run as a process. Evaluation can either happen on HuggingFace models hosted on the Hub, or on local models in shared storage on a Linux filesystem that resolve to [Weights and Biases Artifacts](https://docs.wandb.ai/ref/python/artifact) objects.
## lm-evaluation-harness

Evaluation is currently done via [EleutherAI's lm-evaluation-harness package](https://github.com/EleutherAI/lm-evaluation-harness) run as a process. Evaluation can either happen on HuggingFace models hosted on the Hub, or on local models in shared storage on a Linux filesystem that resolve to [Weights and Biases Artifacts](https://docs.wandb.ai/ref/python/artifact) objects
aittalam marked this conversation as resolved.
Show resolved Hide resolved

In the `evaluation` directory, there are sample files for running evaluation on a model in HuggingFace (`lm_harness_hf_config.yaml`), or using a local inference server hosted on vLLM, (`lm_harness_inference_server_config.yaml`).

## Prometheus

Evaluation relies on [Prometheus](https://github.com/kaistAI/Prometheus) as LLM judge. We internally serve it via [vLLM](https://github.com/vllm-project/vllm) but any other OpenAI API compatible service should work (e.g. llamafile via their `api_like_OAI.py` script).

Input datasets _must_ be in HuggingFace format. The code below shows how to convert Prometheus benchmark datasets and optionally save them as wandb artifacts:
aittalam marked this conversation as resolved.
Show resolved Hide resolved

```
import wandb
from datasets import load_dataset
from lm_buddy.tracking.artifact_utils import (
ArtifactType,
build_directory_artifact,
)
from lm_buddy.jobs.common import JobType

artifact_name = "tutorial_vicuna_eval"
dataset_fname = "/path/to/prometheus/evaluation/benchmark/data/vicuna_eval.json"
output_path = "/tmp/tutorial_vicuna_eval"

# load the json dataset and save it in HF format
ds = load_dataset("json", data_files = dataset_fname, split='train')
ds.save_to_disk(output_path)

with wandb.init(job_type=JobType.PREPROCESSING,
project="wandb-project-name",
entity="wandb-entity-name",
name=artifact_name
):
artifact = build_directory_artifact(
dir_path=output_path,
artifact_name=artifact_name,
artifact_type=ArtifactType.DATASET,
reference=False,
)
wandb.log_artifact(artifact)
```

In the `evaluation` directory, you will find a sample `prometheus_config.yaml` file for running Prometheus evaluation. Before using it, you will need to specify the `path` of the input dataset, the `base_url` where the Prometheus model is served, and
the `tracking` options to save the evaluation output on wandb.

You can then run the evaluation as:

```
python -m lm_buddy evaluate prometheus --config /path/to/prometheus_config.yaml
aittalam marked this conversation as resolved.
Show resolved Hide resolved
```
14 changes: 10 additions & 4 deletions examples/configs/evaluation/prometheus_config.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
name: "lm-buddy-prometheus-job"

dataset:
# dataset stored as wandb artifact
path: "wandb://sample-entity/lm-buddy-examples/wandb-file-artifact:latest"
# dataset stored locally on disk
# path: "file:///path/to/hf_dataset_directory"
# field containing scoring instructions in the json file
text_field: "instruction"

prometheus:
inference:
base_url: "http://your.vllm.server:8000/v1"
# if you use llamafile and api_like_OAI.py,
# the base url will be the following one
# base_url: "http://localhost:8081/v1"
engine: "hf://kaist-ai/prometheus-13b-v1.0"
best_of: 1
max_tokens: 512
Expand All @@ -21,12 +27,12 @@ evaluation:
# max number of retries if a communication error
# with the server occurs
max_retries: 5
# min and max scores as defined in the scoring rubric
min_score: 1
max_score: 5
# scores as defined in the scoring rubric
scores: ["1", "2", "3", "4", "5"]
# enable/disable tqdm to track eval progress
enable_tqdm: True

# save evaluation results as a wandb artifact
tracking:
project: "lm-buddy-examples"
entity: "sample"
entity: "sample-entity"
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ dependencies = [
"ragas==0.1.5",
"langchain-community==0.0.29",
"langchain_openai==0.1.1",
"sentencepiece==0.2.0",
]

[project.optional-dependencies]
Expand Down
Loading