-
Notifications
You must be signed in to change notification settings - Fork 752
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: test-generation improvements (#1698)
Made multi hop query creation simpler Added docs to creat custom single hop query Added docs to creat custom multi hop query
- Loading branch information
1 parent
7d05143
commit cc31f65
Showing
17 changed files
with
1,529 additions
and
85 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
272 changes: 272 additions & 0 deletions
272
docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,272 @@ | ||
# Create custom single-hop queries from your documents | ||
|
||
### Load sample documents | ||
I am using documents from [gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command. | ||
|
||
|
||
```python | ||
from langchain_community.document_loaders import DirectoryLoader | ||
|
||
|
||
path = "Sample_Docs_Markdown/" | ||
loader = DirectoryLoader(path, glob="**/*.md") | ||
docs = loader.load() | ||
``` | ||
|
||
### Create KG | ||
|
||
Create a base knowledge graph with the documents | ||
|
||
|
||
```python | ||
from ragas.testset.graph import KnowledgeGraph | ||
from ragas.testset.graph import Node, NodeType | ||
|
||
|
||
kg = KnowledgeGraph() | ||
for doc in docs: | ||
kg.nodes.append( | ||
Node( | ||
type=NodeType.DOCUMENT, | ||
properties={ | ||
"page_content": doc.page_content, | ||
"document_metadata": doc.metadata, | ||
}, | ||
) | ||
) | ||
``` | ||
|
||
/opt/homebrew/Caskroom/miniforge/base/envs/ragas/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html | ||
from .autonotebook import tqdm as notebook_tqdm | ||
|
||
|
||
### Set up the LLM and Embedding Model | ||
You may use any of [your choice](/docs/howtos/customizations/customize_models.md), here I am using models from open-ai. | ||
|
||
|
||
```python | ||
from ragas.llms.base import llm_factory | ||
from ragas.embeddings.base import embedding_factory | ||
|
||
llm = llm_factory() | ||
embedding = embedding_factory() | ||
``` | ||
|
||
### Setup the transforms | ||
|
||
|
||
Here we are using 2 extractors and 2 relationship builders. | ||
- Headline extrator: Extracts headlines from the documents | ||
- Keyphrase extractor: Extracts keyphrases from the documents | ||
- Headline splitter: Splits the document into nodes based on headlines | ||
|
||
|
||
|
||
```python | ||
from ragas.testset.transforms import apply_transforms | ||
from ragas.testset.transforms import ( | ||
HeadlinesExtractor, | ||
HeadlineSplitter, | ||
KeyphrasesExtractor, | ||
) | ||
|
||
|
||
headline_extractor = HeadlinesExtractor(llm=llm) | ||
headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000) | ||
keyphrase_extractor = KeyphrasesExtractor( | ||
llm=llm, property_name="keyphrases", max_num=10 | ||
) | ||
``` | ||
|
||
|
||
```python | ||
transforms = [ | ||
headline_extractor, | ||
headline_splitter, | ||
keyphrase_extractor, | ||
] | ||
|
||
apply_transforms(kg, transforms=transforms) | ||
``` | ||
|
||
Applying KeyphrasesExtractor: 6%| | 2/36 [00:01<00:20, 1Property 'keyphrases' already exists in node '514fdc'. Skipping! | ||
Applying KeyphrasesExtractor: 11%| | 4/36 [00:01<00:10, 2Property 'keyphrases' already exists in node '84a0f6'. Skipping! | ||
Applying KeyphrasesExtractor: 64%|▋| 23/36 [00:03<00:01, Property 'keyphrases' already exists in node '93f19d'. Skipping! | ||
Applying KeyphrasesExtractor: 72%|▋| 26/36 [00:04<00:00, 1Property 'keyphrases' already exists in node 'a126bf'. Skipping! | ||
Applying KeyphrasesExtractor: 81%|▊| 29/36 [00:04<00:00, Property 'keyphrases' already exists in node 'c230df'. Skipping! | ||
Applying KeyphrasesExtractor: 89%|▉| 32/36 [00:04<00:00, 1Property 'keyphrases' already exists in node '4f2765'. Skipping! | ||
Property 'keyphrases' already exists in node '4a4777'. Skipping! | ||
|
||
### Configure personas | ||
|
||
You can also do this automatically by using the [automatic persona generator](/docs/howtos/customizations/testgenerator/_persona_generator.md) | ||
|
||
|
||
```python | ||
from ragas.testset.persona import Persona | ||
|
||
person1 = Persona( | ||
name="gitlab employee", | ||
role_description="A junior gitlab employee curious on workings on gitlab", | ||
) | ||
persona2 = Persona( | ||
name="Hiring manager at gitlab", | ||
role_description="A hiring manager at gitlab trying to underestand hiring policies in gitlab", | ||
) | ||
persona_list = [person1, persona2] | ||
``` | ||
|
||
## | ||
|
||
## SingleHop Query | ||
|
||
Inherit from `SingleHopQuerySynthesizer` and modify the function that generates scenarios for query creation. | ||
|
||
**Steps**: | ||
- find qualified set of nodes for the query creation. Here I am selecting all nodes with keyphrases extracted. | ||
- For each qualified set | ||
- Match the keyphrase with one or more persona. | ||
- Create all possible combinations of (Node, Persona, Query Style, Query Length) | ||
- Samples the required number of queries from the combinations | ||
|
||
|
||
```python | ||
from ragas.testset.synthesizers.single_hop import ( | ||
SingleHopQuerySynthesizer, | ||
SingleHopScenario, | ||
) | ||
from dataclasses import dataclass | ||
from ragas.testset.synthesizers.prompts import ( | ||
ThemesPersonasInput, | ||
ThemesPersonasMatchingPrompt, | ||
) | ||
|
||
|
||
@dataclass | ||
class MySingleHopScenario(SingleHopQuerySynthesizer): | ||
|
||
theme_persona_matching_prompt = ThemesPersonasMatchingPrompt() | ||
|
||
async def _generate_scenarios(self, n, knowledge_graph, persona_list, callbacks): | ||
|
||
property_name = "keyphrases" | ||
nodes = [] | ||
for node in knowledge_graph.nodes: | ||
if node.type.name == "CHUNK" and node.get_property(property_name): | ||
nodes.append(node) | ||
|
||
number_of_samples_per_node = max(1, n // len(nodes)) | ||
|
||
scenarios = [] | ||
for node in nodes: | ||
if len(scenarios) >= n: | ||
break | ||
themes = node.properties.get(property_name, [""]) | ||
prompt_input = ThemesPersonasInput(themes=themes, personas=persona_list) | ||
persona_concepts = await self.theme_persona_matching_prompt.generate( | ||
data=prompt_input, llm=self.llm, callbacks=callbacks | ||
) | ||
base_scenarios = self.prepare_combinations( | ||
node, | ||
themes, | ||
personas=persona_list, | ||
persona_concepts=persona_concepts.mapping, | ||
) | ||
scenarios.extend( | ||
self.sample_combinations(base_scenarios, number_of_samples_per_node) | ||
) | ||
|
||
return scenarios | ||
``` | ||
|
||
|
||
```python | ||
query = MySingleHopScenario(llm=llm) | ||
``` | ||
|
||
|
||
```python | ||
scenarios = await query.generate_scenarios( | ||
n=5, knowledge_graph=kg, persona_list=persona_list | ||
) | ||
``` | ||
|
||
|
||
```python | ||
scenarios[0] | ||
``` | ||
|
||
|
||
|
||
|
||
SingleHopScenario( | ||
nodes=1 | ||
term=what is an ally | ||
persona=name='Hiring manager at gitlab' role_description='A hiring manager at gitlab trying to underestand hiring policies in gitlab' | ||
style=Web search like queries | ||
length=long) | ||
|
||
|
||
|
||
|
||
```python | ||
result = await query.generate_sample(scenario=scenarios[-1]) | ||
``` | ||
|
||
### Modify prompt to customize the query style | ||
Here I am replacing the default prompt with an instruction to generate only Yes/No questions. This is an optional step. | ||
|
||
|
||
```python | ||
instruction = """Generate a Yes/No query and answer based on the specified conditions (persona, term, style, length) | ||
and the provided context. Ensure the answer is entirely faithful to the context, using only the information | ||
directly from the provided context. | ||
### Instructions: | ||
1. **Generate a Yes/No Query**: Based on the context, persona, term, style, and length, create a question | ||
that aligns with the persona's perspective, incorporates the term, and can be answered with 'Yes' or 'No'. | ||
2. **Generate an Answer**: Using only the content from the provided context, provide a 'Yes' or 'No' answer | ||
to the query. Do not add any information not included in or inferable from the context.""" | ||
``` | ||
|
||
|
||
```python | ||
prompt = query.get_prompts()["generate_query_reference_prompt"] | ||
prompt.instruction = instruction | ||
query.set_prompts(**{"generate_query_reference_prompt": prompt}) | ||
``` | ||
|
||
|
||
```python | ||
result = await query.generate_sample(scenario=scenarios[-1]) | ||
``` | ||
|
||
|
||
```python | ||
result.user_input | ||
``` | ||
|
||
|
||
|
||
|
||
'Does the Diversity, Inclusion & Belonging (DIB) Team at GitLab have a structured approach to encourage collaborations among team members through various communication methods?' | ||
|
||
|
||
|
||
|
||
```python | ||
result.reference | ||
``` | ||
|
||
|
||
|
||
|
||
'Yes' | ||
|
||
|
||
|
||
|
||
```python | ||
|
||
``` |
Oops, something went wrong.