Merge pull request #15 from StacklokLabs/yaml-prompts

Implement YAML and CLI approach
StacklokLabs · Nov 24, 2024 · e52821b · e52821b
2 parents cf8b2ae + f575d8a
commit e52821b
Show file tree

Hide file tree

Showing 30 changed files with 4,353 additions and 327 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -22,19 +22,38 @@ jobs:
         python-version: ${{ matrix.python-version }}
         cache: 'pip'
 
-    - name: Install dependencies
+    - name: Install Poetry
       run: |
         python -m pip install --upgrade pip
-        pip install -r requirements.txt
-        pip install -r requirements-dev.txt
+        curl -sSL https://install.python-poetry.org | python3 -
 
-    - name: Run tests with pytest
-      run: |
-        make test
-        
-    - name: Run style checks
+    - name: Configure Poetry
       run: |
-        pip install ruff
-        ruff check .
-        ruff format --check .
+        poetry config virtualenvs.create true
+        poetry config virtualenvs.in-project true
+
+    - name: Cache Poetry virtualenv
+      uses: actions/cache@v3
+      id: cache
+      with:
+        path: ./.venv
+        key: venv-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/poetry.lock') }}
+
+    - name: Install dependencies
+      if: steps.cache.outputs.cache-hit != 'true'
+      run: poetry install --with dev
+
+    - name: Run code formatting
+      run: poetry run make format
+
+    - name: Run linting
+      run: poetry run make lint
+
+    - name: Run tests
+      run: poetry run make test
+
+    - name: Run security checks
+      run: poetry run make security
 
+    - name: Run build
+      run: poetry run make build
diff --git a/Makefile b/Makefile
@@ -1,12 +1,30 @@
-# Makefile
-.PHONY: test test-all lint
+.PHONY: clean install format lint test security build all
 
-test:
-	pytest -v --cov=promptwright --cov-report=xml
+clean:
+	rm -rf build/
+	rm -rf dist/
+	rm -rf *.egg-info
+	rm -f .coverage
+	find . -type d -name '__pycache__' -exec rm -rf {} +
+	find . -type f -name '*.pyc' -delete
+
+install:
+	poetry install --with dev
 
-test-all:
-	pytest -v
+format:
+	poetry run black .
+	poetry run ruff check --fix .
 
 lint:
-	ruff check .
-	ruff format --check .
+	poetry run ruff check .
+
+test:
+	poetry run pytest
+
+security:
+	poetry run bandit -r promptwright/
+
+build: clean test
+	poetry build
+
+all: clean install format lint test security build
diff --git a/README.md b/README.md
@@ -5,80 +5,212 @@
 
 ![promptwright-cover](https://github.com/user-attachments/assets/5e345bda-df66-474b-90e7-f488d8f89032)
 
-Promptwright is a Python library from [Stacklok](https://stacklok.com) designed for generating large synthetic 
-datasets using a local LLM. The library offers a flexible and easy-to-use set of interfaces, enabling users
-the ability to generate prompt led synthetic datasets.
+Promptwright is a Python library from [Stacklok](https://stacklok.com) designed 
+for generating large synthetic  datasets using a local LLM. The library offers
+a flexible and easy-to-use set of interfaces, enabling users the ability to
+generate prompt led synthetic datasets.
 
 Promptwright was inspired by the [redotvideo/pluto](https://github.com/redotvideo/pluto),
-in fact it started as fork, but ended up largley being a re-write, to allow dataset generation
-against a local LLM model.
+in fact it started as fork, but ended up largley being a re-write, to allow
+dataset generation against a local LLM model.
 
 The library interfaces with Ollama, making it easy to just pull a model and run
-Promptwright.
+Promptwright, but other providers could be used, as long as they provide a
+compatible API (happy to help expand the library to support other providers,
+just open an issue).
 
 ## Features
 
 - **Local LLM Client Integration**: Interact with Ollama based models
 - **Configurable Instructions and Prompts**: Define custom instructions and system prompts
-- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub.
+- **YAML Configuration**: Define your generation tasks using YAML configuration files
+- **Command Line Interface**: Run generation tasks directly from the command line
+- **Push to Hugging Face**: Push the generated dataset to Hugging Face Hub with automatic dataset cards and tags
 
 ## Getting Started
 
 ### Prerequisites
 
 - Python 3.11+
-- `promptwright` library installed
+- Poetry (for dependency management)
 - Ollama CLI installed and running (see [Ollama Installation](https://ollama.com/)
 - A Model pulled via Ollama (see [Model Compatibility](#model-compatibility))
+- (Optional) Hugging Face account and API token for dataset upload
 
 ### Installation
 
 To install the prerequisites, you can use the following commands:
 
 ```bash
-pip install promptwright
+# Install Poetry if you haven't already
+curl -sSL https://install.python-poetry.org | python3 -
+
+# Install promptwright and its dependencies
+git clone https://github.com/StacklokLabs/promptwright.git
+cd promptwright
+poetry install
+
+# Start Ollama service
 ollama serve
+
+# Pull your desired model
 ollama pull {model_name} # whichever model you want to use
 ```
 
-### Example Usage
-
-There are a few examples in the `examples` directory that demonstrate how to use
-the library to generate different topic based datasets.
-
-### Running an Example
-
-To run an example:
-
-1. Ensure you have started Ollama by running `ollama serve`.
-2. Verify that the required model is downloaded (e.g. `llama3.2:latest`).
-4. Set the `model_name` in the chosen example file to the model you have downloaded.
-
-  ```python
-
-      tree = TopicTree(
-        args=TopicTreeArguments(
-            root_prompt="Creative Writing Prompts",
-            model_system_prompt=system_prompt,
-            tree_degree=5, # Increase degree for more prompts
-            tree_depth=4, # Increase depth for more prompts
-            temperature=0.9, # Higher temperature for more creative variations
-            model_name="ollama/llama3" # Set the model name here
-        )
-      )
-      engine = DataEngine(
-        args=EngineArguments(
-            instructions="Generate creative writing prompts and example responses.",
-            system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
-            model_name="ollama/llama3",
-            temperature=0.9,
-            max_retries=2,
-  ```
-5. Run your chosen example file:
-   ```bash
-   python example/creative_writing.py
-   ```
-6. The generated dataset will be saved to a JSONL file to whatever is set within  `dataset.save()`.
+### Usage
+
+Promptwright offers two ways to define and run your generation tasks:
+
+#### 1. Using YAML Configuration (Recommended)
+
+Create a YAML file defining your generation task:
+
+```yaml
+system_prompt: "You are a helpful assistant. You provide clear and concise answers to user questions."
+
+topic_tree:
+  args:
+    root_prompt: "Capital Cities of the World."
+    model_system_prompt: "<system_prompt_placeholder>"
+    tree_degree: 3
+    tree_depth: 2
+    temperature: 0.7
+    model_name: "ollama/mistral:latest"
+  save_as: "basic_prompt_topictree.jsonl"
+
+data_engine:
+  args:
+    instructions: "Please provide training examples with questions about capital cities."
+    system_prompt: "<system_prompt_placeholder>"
+    model_name: "ollama/mistral:latest"
+    temperature: 0.9
+    max_retries: 2
+
+dataset:
+  creation:
+    num_steps: 5
+    batch_size: 1
+    model_name: "ollama/mistral:latest"
+  save_as: "basic_prompt_dataset.jsonl"
+
+# Optional Hugging Face Hub configuration
+huggingface:
+  # Repository in format "username/dataset-name"
+  repository: "your-username/your-dataset-name"
+  # Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option
+  token: "your-hf-token"
+  # Additional tags for the dataset (optional)
+  # "promptwright" and "synthetic" tags are added automatically
+  tags:
+    - "promptwright-generated-dataset"
+    - "geography"
+```
+
+Run using the CLI:
+
+```bash
+promptwright start config.yaml
+```
+
+The CLI supports various options to override configuration values:
+
+```bash
+promptwright start config.yaml \
+  --topic-tree-save-as output_tree.jsonl \
+  --dataset-save-as output_dataset.jsonl \
+  --model-name ollama/llama3 \
+  --temperature 0.8 \
+  --tree-degree 4 \
+  --tree-depth 3 \
+  --num-steps 10 \
+  --batch-size 2 \
+  --hf-repo username/dataset-name \
+  --hf-token your-token \
+  --hf-tags tag1 --hf-tags tag2
+```
+
+#### Hugging Face Hub Integration
+
+Promptwright supports automatic dataset upload to the Hugging Face Hub with the following features:
+
+1. **Dataset Upload**: Upload your generated dataset directly to Hugging Face Hub
+2. **Dataset Cards**: Automatically creates and updates dataset cards
+3. **Automatic Tags**: Adds "promptwright" and "synthetic" tags automatically
+4. **Custom Tags**: Support for additional custom tags
+5. **Flexible Authentication**: HF token can be provided via:
+   - CLI option: `--hf-token your-token`
+   - Environment variable: `export HF_TOKEN=your-token`
+   - YAML configuration: `huggingface.token`
+
+Example using environment variable:
+```bash
+export HF_TOKEN=your-token
+promptwright start config.yaml --hf-repo username/dataset-name
+```
+
+Or pass it in as a CLI option:
+```bash
+promptwright start config.yaml --hf-repo username/dataset-name --hf-token your-token
+```
+
+#### 2. Using Python Code
+
+You can also create generation tasks programmatically using Python code. There
+are several examples in the `examples` directory that demonstrate this approach.
+
+Example Python usage:
+
+```python
+from promptwright import DataEngine, EngineArguments, TopicTree, TopicTreeArguments
+
+tree = TopicTree(
+    args=TopicTreeArguments(
+        root_prompt="Creative Writing Prompts",
+        model_system_prompt=system_prompt,
+        tree_degree=5,
+        tree_depth=4,
+        temperature=0.9,
+        model_name="ollama/llama3"
+    )
+)
+
+engine = DataEngine(
+    args=EngineArguments(
+        instructions="Generate creative writing prompts and example responses.",
+        system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
+        model_name="ollama/llama3",
+        temperature=0.9,
+        max_retries=2,
+    )
+)
+```
+
+### Development
+
+The project uses Poetry for dependency management. Here are some common development commands:
+
+```bash
+# Install dependencies including development dependencies
+make install
+
+# Format code
+make format
+
+# Run linting
+make lint
+
+# Run tests
+make test
+
+# Run security checks
+make security
+
+# Build the package
+make build
+
+# Run all checks and build
+make all
+```
 
 ### Prompt Output Examples
 
@@ -108,7 +240,7 @@ following models so far:
 
 - **Mistral**
 - **LLaMA3**
---**Qwen2.5**
+- **Qwen2.5**
 
 ## Unpredictable Behavior