Skip to content
This repository has been archived by the owner on Sep 24, 2024. It is now read-only.

Davide/tutorial #97

Merged
merged 8 commits into from
Apr 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 49 additions & 1 deletion docs/source/evaluation_concepts.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,54 @@
Evaluation
====================================

All evaluation is currently done via [EleutherAI's lm-evaluation-harness package](https://github.com/EleutherAI/lm-evaluation-harness) run as a process. Evaluation can either happen on HuggingFace models hosted on the Hub, or on local models in shared storage on a Linux filesystem that resolve to [Weights and Biases Artifacts](https://docs.wandb.ai/ref/python/artifact) objects.
## lm-evaluation-harness

[EleutherAI's lm-evaluation-harness package](https://github.com/EleutherAI/lm-evaluation-harness) is used internally to access a variety of benchmark datasets. The model to evaluate can be loaded directly from the HuggingFace Hub, from a local model checkpoint saved on the filesystem, or from a [Weights and Biases artifact](https://docs.wandb.ai/ref/python/artifact) object based on the `path` parameter specified in the evaluation config.

In the `evaluation` directory, there are sample files for running evaluation on a model in HuggingFace (`lm_harness_hf_config.yaml`), or using a local inference server hosted on vLLM, (`lm_harness_inference_server_config.yaml`).

## Prometheus

Evaluation relies on [Prometheus](https://github.com/kaistAI/Prometheus) as LLM judge. We internally serve it via [vLLM](https://github.com/vllm-project/vllm) but any other OpenAI API compatible service should work (e.g. llamafile via their `api_like_OAI.py` script).

Input datasets _must_ be saved as HuggingFace [datasets.Dataset](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset). The code below shows how to convert Prometheus benchmark datasets and optionally save them as wandb artifacts:

```
import wandb
from datasets import load_dataset
from lm_buddy.tracking.artifact_utils import (
ArtifactType,
build_directory_artifact,
)
from lm_buddy.jobs.common import JobType

artifact_name = "tutorial_vicuna_eval"
dataset_fname = "/path/to/prometheus/evaluation/benchmark/data/vicuna_eval.json"
output_path = "/tmp/tutorial_vicuna_eval"

# load the json dataset and save it in HF format
ds = load_dataset("json", data_files = dataset_fname, split='train')
ds.save_to_disk(output_path)

with wandb.init(job_type=JobType.PREPROCESSING,
project="wandb-project-name",
entity="wandb-entity-name",
name=artifact_name
):
artifact = build_directory_artifact(
dir_path=output_path,
artifact_name=artifact_name,
artifact_type=ArtifactType.DATASET,
reference=False,
)
wandb.log_artifact(artifact)
```

In the `evaluation` directory, you will find a sample `prometheus_config.yaml` file for running Prometheus evaluation. Before using it, you will need to specify the `path` of the input dataset, the `base_url` where the Prometheus model is served, and
the `tracking` options to save the evaluation output on wandb.

You can then run the evaluation as:

```
lm_buddy evaluate prometheus --config /path/to/prometheus_config.yaml
```
14 changes: 10 additions & 4 deletions examples/configs/evaluation/prometheus_config.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
name: "lm-buddy-prometheus-job"

dataset:
# dataset stored as wandb artifact
path: "wandb://sample-entity/lm-buddy-examples/wandb-file-artifact:latest"
# dataset stored locally on disk
# path: "file:///path/to/hf_dataset_directory"
# field containing scoring instructions in the json file
text_field: "instruction"

prometheus:
inference:
base_url: "http://your.vllm.server:8000/v1"
# if you use llamafile and api_like_OAI.py,
# the base url will be the following one
# base_url: "http://localhost:8081/v1"
engine: "hf://kaist-ai/prometheus-13b-v1.0"
best_of: 1
max_tokens: 512
Expand All @@ -21,12 +27,12 @@ evaluation:
# max number of retries if a communication error
# with the server occurs
max_retries: 5
# min and max scores as defined in the scoring rubric
min_score: 1
max_score: 5
# scores as defined in the scoring rubric
scores: ["1", "2", "3", "4", "5"]
# enable/disable tqdm to track eval progress
enable_tqdm: True

# save evaluation results as a wandb artifact
tracking:
project: "lm-buddy-examples"
entity: "sample"
entity: "sample-entity"
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "lm-buddy"
version = "0.10.1"
version = "0.10.2"
authors = [
{ name = "Sean Friedowitz", email = "[email protected]" },
{ name = "Aaron Gonzales", email = "[email protected]" },
Expand Down Expand Up @@ -45,6 +45,7 @@ dependencies = [
"ragas==0.1.5",
"langchain-community==0.0.29",
"langchain_openai==0.1.1",
"sentencepiece==0.2.0",
]

[project.optional-dependencies]
Expand Down
2 changes: 1 addition & 1 deletion tests/unit/test_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@


def test_prompt_formatting(resources_dir):
dataset = load_from_disk(resources_dir / "datasets" / "tiny_shakespeare")
dataset = load_from_disk(str(resources_dir / "datasets" / "tiny_shakespeare"))

template = "Let's put some {text} in here"
formatted_dataset = format_dataset_with_prompt(dataset, template, output_field="prompt")
Expand Down
Loading