mozilla-ai · aittalam · Apr 23, 2024 · Apr 23, 2024 · Apr 23, 2024 · Apr 23, 2024
diff --git a/docs/source/evaluation_concepts.md b/docs/source/evaluation_concepts.md
@@ -1,6 +1,54 @@
 Evaluation
 ====================================
 
-All evaluation is currently done via [EleutherAI's lm-evaluation-harness package](https://github.com/EleutherAI/lm-evaluation-harness) run as a process. Evaluation can either happen on HuggingFace models hosted on the Hub, or on local models in shared storage on a Linux filesystem that resolve to [Weights and Biases Artifacts](https://docs.wandb.ai/ref/python/artifact) objects.
+## lm-evaluation-harness
+
+Evaluation is currently done via [EleutherAI's lm-evaluation-harness package](https://github.com/EleutherAI/lm-evaluation-harness) run as a process. Evaluation can either happen on HuggingFace models hosted on the Hub, or on local models in shared storage on a Linux filesystem that resolve to [Weights and Biases Artifacts](https://docs.wandb.ai/ref/python/artifact) objects
 
 In the `evaluation` directory, there are sample files for running evaluation on a model in HuggingFace (`lm_harness_hf_config.yaml`), or using a local inference server hosted on vLLM, (`lm_harness_inference_server_config.yaml`).
+
+## Prometheus
+
+Evaluation relies on [Prometheus](https://github.com/kaistAI/Prometheus) as LLM judge. We internally serve it via [vLLM](https://github.com/vllm-project/vllm) but any other OpenAI API compatible service should work (e.g. llamafile via their `api_like_OAI.py` script).
+
+Input datasets _must_ be in HuggingFace format. The code below shows how to convert Prometheus benchmark datasets and optionally save them as wandb artifacts:
+
+```
+import wandb
+from datasets import load_dataset
+from lm_buddy.tracking.artifact_utils import (
+    ArtifactType,
+    build_directory_artifact,
+)
+from lm_buddy.jobs.common import JobType
+
+artifact_name = "tutorial_vicuna_eval"
+dataset_fname = "/path/to/prometheus/evaluation/benchmark/data/vicuna_eval.json"
+output_path = "/tmp/tutorial_vicuna_eval"
+
+# load the json dataset and save it in HF format
+ds = load_dataset("json", data_files = dataset_fname, split='train')
+ds.save_to_disk(output_path)
+
+with wandb.init(job_type=JobType.PREPROCESSING,
+                project="wandb-project-name",
+                entity="wandb-entity-name",
+                name=artifact_name
+               ):
+    artifact = build_directory_artifact(
+        dir_path=output_path,
+        artifact_name=artifact_name,
+        artifact_type=ArtifactType.DATASET,
+        reference=False,
+    )
+    wandb.log_artifact(artifact)
+```
+
+In the `evaluation` directory, you will find a sample `prometheus_config.yaml` file for running Prometheus evaluation. Before using it, you will need to specify the `path` of the input dataset, the `base_url` where the Prometheus model is served, and
+the `tracking` options to save the evaluation output on wandb.
+
+You can then run the evaluation as:
+
+```
+python -m lm_buddy evaluate prometheus --config /path/to/prometheus_config.yaml
+```
diff --git a/examples/configs/evaluation/prometheus_config.yaml b/examples/configs/evaluation/prometheus_config.yaml
@@ -1,13 +1,19 @@
 name: "lm-buddy-prometheus-job"
 
 dataset:
+  # dataset stored as wandb artifact
   path: "wandb://sample-entity/lm-buddy-examples/wandb-file-artifact:latest"
+  # dataset stored locally on disk
+  # path: "file:///path/to/hf_dataset_directory"
   # field containing scoring instructions in the json file
   text_field: "instruction"
 
 prometheus:
   inference:
     base_url: "http://your.vllm.server:8000/v1"
+    # if you use llamafile and api_like_OAI.py,
+    # the base url will be the following one
+    # base_url: "http://localhost:8081/v1"
     engine: "hf://kaist-ai/prometheus-13b-v1.0"
   best_of: 1
   max_tokens: 512
@@ -21,12 +27,12 @@ evaluation:
   # max number of retries if a communication error
   # with the server occurs
   max_retries: 5
-  # min and max scores as defined in the scoring rubric
-  min_score: 1
-  max_score: 5
+  # scores as defined in the scoring rubric
+  scores: ["1", "2", "3", "4", "5"]
   # enable/disable tqdm to track eval progress
   enable_tqdm: True
 
+# save evaluation results as a wandb artifact
 tracking:
   project: "lm-buddy-examples"
-  entity: "sample"
+  entity: "sample-entity"
diff --git a/pyproject.toml b/pyproject.toml
@@ -45,6 +45,7 @@ dependencies = [
     "ragas==0.1.5",
     "langchain-community==0.0.29",
     "langchain_openai==0.1.1",
+    "sentencepiece==0.2.0",
 ]
 
 [project.optional-dependencies]