Skip to content

Commit

Permalink
Merge pull request #15 from StacklokLabs/yaml-prompts
Browse files Browse the repository at this point in the history
Implement YAML and CLI approach
  • Loading branch information
lukehinds authored Nov 24, 2024
2 parents cf8b2ae + f575d8a commit e52821b
Show file tree
Hide file tree
Showing 30 changed files with 4,353 additions and 327 deletions.
41 changes: 30 additions & 11 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,38 @@ jobs:
python-version: ${{ matrix.python-version }}
cache: 'pip'

- name: Install dependencies
- name: Install Poetry
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
curl -sSL https://install.python-poetry.org | python3 -
- name: Run tests with pytest
run: |
make test
- name: Run style checks
- name: Configure Poetry
run: |
pip install ruff
ruff check .
ruff format --check .
poetry config virtualenvs.create true
poetry config virtualenvs.in-project true
- name: Cache Poetry virtualenv
uses: actions/cache@v3
id: cache
with:
path: ./.venv
key: venv-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/poetry.lock') }}

- name: Install dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: poetry install --with dev

- name: Run code formatting
run: poetry run make format

- name: Run linting
run: poetry run make lint

- name: Run tests
run: poetry run make test

- name: Run security checks
run: poetry run make security

- name: Run build
run: poetry run make build
34 changes: 26 additions & 8 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,12 +1,30 @@
# Makefile
.PHONY: test test-all lint
.PHONY: clean install format lint test security build all

test:
pytest -v --cov=promptwright --cov-report=xml
clean:
rm -rf build/
rm -rf dist/
rm -rf *.egg-info
rm -f .coverage
find . -type d -name '__pycache__' -exec rm -rf {} +
find . -type f -name '*.pyc' -delete

install:
poetry install --with dev

test-all:
pytest -v
format:
poetry run black .
poetry run ruff check --fix .

lint:
ruff check .
ruff format --check .
poetry run ruff check .

test:
poetry run pytest

security:
poetry run bandit -r promptwright/

build: clean test
poetry build

all: clean install format lint test security build
228 changes: 180 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,80 +5,212 @@

![promptwright-cover](https://github.com/user-attachments/assets/5e345bda-df66-474b-90e7-f488d8f89032)

Promptwright is a Python library from [Stacklok](https://stacklok.com) designed for generating large synthetic
datasets using a local LLM. The library offers a flexible and easy-to-use set of interfaces, enabling users
the ability to generate prompt led synthetic datasets.
Promptwright is a Python library from [Stacklok](https://stacklok.com) designed
for generating large synthetic datasets using a local LLM. The library offers
a flexible and easy-to-use set of interfaces, enabling users the ability to
generate prompt led synthetic datasets.

Promptwright was inspired by the [redotvideo/pluto](https://github.com/redotvideo/pluto),
in fact it started as fork, but ended up largley being a re-write, to allow dataset generation
against a local LLM model.
in fact it started as fork, but ended up largley being a re-write, to allow
dataset generation against a local LLM model.

The library interfaces with Ollama, making it easy to just pull a model and run
Promptwright.
Promptwright, but other providers could be used, as long as they provide a
compatible API (happy to help expand the library to support other providers,
just open an issue).

## Features

- **Local LLM Client Integration**: Interact with Ollama based models
- **Configurable Instructions and Prompts**: Define custom instructions and system prompts
- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub.
- **YAML Configuration**: Define your generation tasks using YAML configuration files
- **Command Line Interface**: Run generation tasks directly from the command line
- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub with automatic dataset cards and tags

## Getting Started

### Prerequisites

- Python 3.11+
- `promptwright` library installed
- Poetry (for dependency management)
- Ollama CLI installed and running (see [Ollama Installation](https://ollama.com/)
- A Model pulled via Ollama (see [Model Compatibility](#model-compatibility))
- (Optional) Hugging Face account and API token for dataset upload

### Installation

To install the prerequisites, you can use the following commands:

```bash
pip install promptwright
# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -

# Install promptwright and its dependencies
git clone https://github.com/StacklokLabs/promptwright.git
cd promptwright
poetry install

# Start Ollama service
ollama serve

# Pull your desired model
ollama pull {model_name} # whichever model you want to use
```

### Example Usage

There are a few examples in the `examples` directory that demonstrate how to use
the library to generate different topic based datasets.

### Running an Example

To run an example:

1. Ensure you have started Ollama by running `ollama serve`.
2. Verify that the required model is downloaded (e.g. `llama3.2:latest`).
4. Set the `model_name` in the chosen example file to the model you have downloaded.

```python

tree = TopicTree(
args=TopicTreeArguments(
root_prompt="Creative Writing Prompts",
model_system_prompt=system_prompt,
tree_degree=5, # Increase degree for more prompts
tree_depth=4, # Increase depth for more prompts
temperature=0.9, # Higher temperature for more creative variations
model_name="ollama/llama3" # Set the model name here
)
)
engine = DataEngine(
args=EngineArguments(
instructions="Generate creative writing prompts and example responses.",
system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
model_name="ollama/llama3",
temperature=0.9,
max_retries=2,
```
5. Run your chosen example file:
```bash
python example/creative_writing.py
```
6. The generated dataset will be saved to a JSONL file to whatever is set within `dataset.save()`.
### Usage

Promptwright offers two ways to define and run your generation tasks:

#### 1. Using YAML Configuration (Recommended)

Create a YAML file defining your generation task:

```yaml
system_prompt: "You are a helpful assistant. You provide clear and concise answers to user questions."

topic_tree:
args:
root_prompt: "Capital Cities of the World."
model_system_prompt: "<system_prompt_placeholder>"
tree_degree: 3
tree_depth: 2
temperature: 0.7
model_name: "ollama/mistral:latest"
save_as: "basic_prompt_topictree.jsonl"

data_engine:
args:
instructions: "Please provide training examples with questions about capital cities."
system_prompt: "<system_prompt_placeholder>"
model_name: "ollama/mistral:latest"
temperature: 0.9
max_retries: 2

dataset:
creation:
num_steps: 5
batch_size: 1
model_name: "ollama/mistral:latest"
save_as: "basic_prompt_dataset.jsonl"

# Optional Hugging Face Hub configuration
huggingface:
# Repository in format "username/dataset-name"
repository: "your-username/your-dataset-name"
# Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option
token: "your-hf-token"
# Additional tags for the dataset (optional)
# "promptwright" and "synthetic" tags are added automatically
tags:
- "promptwright-generated-dataset"
- "geography"
```
Run using the CLI:
```bash
promptwright start config.yaml
```

The CLI supports various options to override configuration values:

```bash
promptwright start config.yaml \
--topic-tree-save-as output_tree.jsonl \
--dataset-save-as output_dataset.jsonl \
--model-name ollama/llama3 \
--temperature 0.8 \
--tree-degree 4 \
--tree-depth 3 \
--num-steps 10 \
--batch-size 2 \
--hf-repo username/dataset-name \
--hf-token your-token \
--hf-tags tag1 --hf-tags tag2
```

#### Hugging Face Hub Integration

Promptwright supports automatic dataset upload to the Hugging Face Hub with the following features:

1. **Dataset Upload**: Upload your generated dataset directly to Hugging Face Hub
2. **Dataset Cards**: Automatically creates and updates dataset cards
3. **Automatic Tags**: Adds "promptwright" and "synthetic" tags automatically
4. **Custom Tags**: Support for additional custom tags
5. **Flexible Authentication**: HF token can be provided via:
- CLI option: `--hf-token your-token`
- Environment variable: `export HF_TOKEN=your-token`
- YAML configuration: `huggingface.token`

Example using environment variable:
```bash
export HF_TOKEN=your-token
promptwright start config.yaml --hf-repo username/dataset-name
```

Or pass it in as a CLI option:
```bash
promptwright start config.yaml --hf-repo username/dataset-name --hf-token your-token
```

#### 2. Using Python Code

You can also create generation tasks programmatically using Python code. There
are several examples in the `examples` directory that demonstrate this approach.

Example Python usage:

```python
from promptwright import DataEngine, EngineArguments, TopicTree, TopicTreeArguments

tree = TopicTree(
args=TopicTreeArguments(
root_prompt="Creative Writing Prompts",
model_system_prompt=system_prompt,
tree_degree=5,
tree_depth=4,
temperature=0.9,
model_name="ollama/llama3"
)
)

engine = DataEngine(
args=EngineArguments(
instructions="Generate creative writing prompts and example responses.",
system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
model_name="ollama/llama3",
temperature=0.9,
max_retries=2,
)
)
```

### Development

The project uses Poetry for dependency management. Here are some common development commands:

```bash
# Install dependencies including development dependencies
make install

# Format code
make format

# Run linting
make lint

# Run tests
make test

# Run security checks
make security

# Build the package
make build

# Run all checks and build
make all
```

### Prompt Output Examples

Expand Down Expand Up @@ -108,7 +240,7 @@ following models so far:

- **Mistral**
- **LLaMA3**
--**Qwen2.5**
- **Qwen2.5**

## Unpredictable Behavior

Expand Down
Loading

0 comments on commit e52821b

Please sign in to comment.