Evaluation is crucial for retrieval augmented generation (RAG) pipelines as it ensures the accuracy and relevance of information retrieved as well as the generated content.
There are 3 components needed for evaluating the performance of a RAG pipeline:
- Data for testing.
- Automated metrics to measure performance of both the context retrieval and response generation.
- Human-like evaluation of the generated response from the end-to-end pipeline.
⚠️ NOTE This tool provides a set of notebooks that show examples of how to address these requirements in an automated fashion for the default canonical developer rag example.
Using an existing knowledge base we can synthetically generate question|answer|context triplets using a LLM. This tool uses the Llama 2 70B model on Nvidia AI Playground for data generation.
RAGAS is an automated metrics tool for measuring performance of both the retriever and generator. We utilize the Nvidia AI Playground langchain wrapper to run RAGAS evaluation on our example RAG pipeline.
We can use LLMs to provide human-like feedback and Likert evaluation scores for full end-to-end RAG pipelines. This tool uses Llama 2 70B as a judge LLM.
-
Follow steps 1 - 5 in the "Prepare the environment" section of example 02.
-
Deploy the developer RAG example via Docker compose by following these steps.
-
Build and deploy the evaluation service
$ docker compose -f deploy/compose/docker-compose-evaluation.yaml build
$ docker compose -f deploy/compose/docker-compose-evaluation.yaml up -d
- Access the notebook server at
http://host-ip:8889
from your web browser and try out the notebooks sequentially starting from Notebook 1: Synthetic Data Generation for RAG Evaluation