Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for the topic tree issues #11

Merged
merged 6 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 39 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ the ability to generate prompt led synthetic datasets.

Promptwright was inspired by the [redotvideo/pluto](https://github.com/redotvideo/pluto),
in fact it started as fork, but ended up largley being a re-write, to allow dataset generation
against a local LLM model, as opposed to OpenAI where costs can be prohibitively expensive.
against a local LLM model.

The library interfaces with Ollama, making it easy to just pull a model and run Promptwright.
The library interfaces with Ollama, making it easy to just pull a model and run
Promptwright.

## Features

Expand Down Expand Up @@ -54,12 +55,23 @@ To run an example:
4. Set the `model_name` in the chosen example file to the model you have downloaded.

```python
engine = LocalDataEngine(
args=LocalEngineArguments(

tree = TopicTree(
args=TopicTreeArguments(
root_prompt="Creative Writing Prompts",
model_system_prompt=system_prompt,
tree_degree=5, # Increase degree for more prompts
tree_depth=4, # Increase depth for more prompts
temperature=0.9, # Higher temperature for more creative variations
model_name="ollama/llama3" # Set the model name here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth having as a constant or an env var that can be configured

)
)
engine = DataEngine(
args=EngineArguments(
instructions="Generate creative writing prompts and example responses.",
system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
model_name="llama3.2:latest",
temperature=0.9, # Higher temperature for more creative variations
model_name="ollama/llama3",
temperature=0.9,
max_retries=2,
```
5. Run your chosen example file:
Expand Down Expand Up @@ -89,47 +101,34 @@ To run an example:
}
```

### Library Overview

#### Classes

- **Dataset**: A class for managing generated datasets.
- **LocalDataEngine**: The main engine responsible for interacting with the LLM client and generating datasets.
- **LocalEngineArguments**: A configuration class that defines the instructions, system prompt, model name temperature, retries, and prompt templates used for generating data.
- **OllamaClient**: A client class for interacting with the Ollama API
- **HFUploader**: A utility class for uploading datasets to Hugging Face (pass in the path to the dataset and token).

### Troubleshooting

If you encounter any errors while running the script, here are a few common troubleshooting steps:

1. **Restart Ollama**:
```bash
killall ollama && ollama serve
```

2. **Verify Model Installation**:
```bash
ollama pull {model_name}
```

3. **Check Ollama Logs**:
Inspect the logs for any error messages that might provide more context on
what went wrong, these can be found in the `~/.ollama/logs` directory.

## Model Compatibility

The library should work with most LLM models. It has been tested with the
following models so far:

- **LLaMA3**: The library is designed to work with the LLaMA model, specifically
the `llama3:latest` model.
- **Mistral**: The library is compatible with the Mistral model, which is a fork
of the GPT-3 model.
- **Mistral**
- **LLaMA3**
--**Qwen2.5**

## Unpredictable Behavior

The library is designed to generate synthetic data based on the prompts and instructions
provided. The quality of the generated data is dependent on the quality of the prompts
and the model used. The library does not guarantee the quality of the generated data.

Large Language Models can sometimes generate unpredictable or inappropriate
content and the authors of this library are not responsible for the content
generated by the models. We recommend reviewing the generated data before using it
in any production environment.

If you test anymore, please make a pull request to update this list!
Large Language Models also have the potential to fail to stick with the behavior
defined by the prompt around JSON formatting, and may generate invalid JSON. This
is a known issue with the underlying model and not the library. We handle these
errors by retrying the generation process and filtering out invalid JSON. The
failure rate is low, but it can happen. We report on each failure within a final
summary.

### Contributing
## Contributing

If something here could be improved, please open an issue or submit a pull request.

Expand Down
Loading