RAG is the application of information retrieval techniques to generative models, such as LLMs, to produce relevant and grounded reponses, conditioned on some external dataset or knowledge base.
Some key challenges with LLMs that are addressed by RAG are:
- Hallucinations
- Adpatability, or lack of expert knowledge (e.g. internal company documentation)
- Limited information due to knowledge cut-off
However, it is not a silver bullet and requires careful consideration in terms of its components and architecture. For example, it is sensitive to things like:
- Quality of embeddings
- Base LLM model for response generation
- Chunking strategy
- "Lost in the Middle" issues where context/information from the middle is lost
- Retrieval strategies (vector-based, keyword-based, hybrid etc.) and more
rag
is built on top of llama-index
and provides a host of options to test different approaches for chunking, embedding, retrieval and generation. These are easily extensible to any future methods, models etc. due to modularity of the same.
- Chunking:
- Base: Uses a basic sentence splitter for chunking along with additional sanitization and cleaning
- Sentence window: Use a sliding window of sentences to chunk the document
- Child-to-parent: Use variable length chunks and creates child-parent relationships, where retrieval is performed on the child nodes and generation is performed using all the context from parent nodes
- Embeddings:
- OpenAI's
text-embedding-3-large
- Cohere's
embed-english-v3.0
- OpenAI's
- LLMs:
- OpenAI's
gpt-3.5-turbo
andgpt-4-turbo-preview
- Mistral's
mistral-medium
- OpenAI's
- Retrieval Strategies:
- BM25 (Best matching algorithm)
- Vector search
- Hybrid (BM25 + Vector search)
A pipeline is a combination of any of the above.
Note: Tested on Ubuntu 20.04 with Python 3.11
- Clone this repository and change directory into it
- Install
poetry
and runpoetry install
to fetch all dependencies - Activate the environment using
poetry shell
- Use
python -m main --query <query>
for the RAG pipeline to respond to your query For e.g.
$ python -m main --query "what is the best car insurance?"
> According to the provided context, the "Comprehensive Plus Insurance" is the top cover offered, which includes loss or damage to your vehicle, up to 15 additional benefits, liability cover, and optional add-ons for an extra cost. However, the best car insurance for you would depend on your specific needs and circumstances. It's always a good idea to carefully review the coverage options and consider factors such as cost, deductibles, and coverage limits before making a decision.
Note: If no additional args are provided, the default pipeline is used which is configured to the best combination as per the section below.
- Optionally, run:
python -m eval
to execute all available RAG pipelines for evaluation
Make sure to add a .env
file at the root directory containing valid OPENAI_API_KEY
, COHERE_API_KEY
, MISTRAL_API_KEY
.
GPT-4 was used to generate a synthetic dataset comprised of two questions per node/chunk as the ground truth based on the documents. The dataset can be accessed under qa_dataset/
.
Note: Experiment results may vary based on the data characteristics, sample size, and other variables like chunk_size, similarity_top_k, and so on.
The following metrics were used to evaluate the pipelines for retrieval and response, respectively:
- Retrieval
- Hit Rate: Fraction of samples in grouth truth that were retrieved
- MRR: Mean Reciprocal Rank
- Response
- Faithfulness: Percentage of samples where the generated response was the same as the ground truth
- Relevancy: Percentage of samples where the generated response was relevant to the question
-
When it comes to overall performance, including metrics for both retrieval and response, Mistral's
mistral-medium
with OpenAI'stext-embedding-3-large
using a hybrid approach (BM25 with vector search) performs the best. Using the same hybrid approach and embeddings but with GPT-3.5-turbo as the LLM, it comes as second best. -
Mistral medium on the whole is more faithful as compared to GPT-3.5-turbo. However, only one combination i.e.
gpt-3.5-turbo
with Cohere'sembed-english-v3.0
, using hybrid retrieval, scored perfectly on both faithfulness and relevancy. -
For retrieval strategies, the hybrid approach consistently outperforms the other two approaches.
-
Keyword-based traditional strategies unsurprisingly perform the worst. However, they tend to augment vector-based methods and improve performance overall as per (3).
Additionally, reranking methods can be employed to further boost performance.
Below is the overall evaluation results as sorted by their hit rate, MRR, faithfulness and relevancy:
LLM | Embedding | Chunking Strategy | Retrieval Strategy | Hit Rate | MRR | Faithfulness | Relevancy |
---|---|---|---|---|---|---|---|
mistral-medium | text-embedding-3-large | base | hybrid | 0.827922 | 0.635823 | 1 | 0.8 |
gpt-3.5-turbo | text-embedding-3-large | base | hybrid | 0.827922 | 0.635552 | 0.9 | 0.9 |
gpt-3.5-turbo | embed-english-v3.0 | base | hybrid | 0.798701 | 0.625 | 1 | 1 |
mistral-medium | embed-english-v3.0 | base | hybrid | 0.798701 | 0.625 | 1 | 0.9 |
mistral-medium | embed-english-v3.0 | base | vector | 0.668831 | 0.582792 | 1 | 0.9 |
gpt-3.5-turbo | embed-english-v3.0 | base | vector | 0.668831 | 0.582792 | 1 | 0.9 |
gpt-3.5-turbo | text-embedding-3-large | base | vector | 0.665584 | 0.590909 | 0.8 | 0.8 |
mistral-medium | text-embedding-3-large | base | vector | 0.665584 | 0.589286 | 1 | 0.8 |
mistral-medium | embed-english-v3.0 | base | bm25 | 0.63961 | 0.574675 | 1 | 0.8 |
mistral-medium | text-embedding-3-large | base | bm25 | 0.63961 | 0.574675 | 1 | 0.7 |
gpt-3.5-turbo | text-embedding-3-large | base | bm25 | 0.63961 | 0.574675 | 0.9 | 0.8 |
gpt-3.5-turbo | embed-english-v3.0 | base | bm25 | 0.63961 | 0.574675 | 0.9 | 0.8 |
Note: GPT-4-turbo was used as the LLM for response evaluation to measure faithfulness and relavancy. Only the first 10 samples were considered due to prohibitive cost while using the same as an evaluator. The
max_queries
argument can be overridden to evaluate more samples.