Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into whoisjones/improved_n…
Browse files Browse the repository at this point in the history
…aming_for_sampling

# Conflicts:
#	tests/test_dataset_generator.py
  • Loading branch information
whoisjones committed Aug 4, 2023
2 parents 18dcd6f + 70c4297 commit 03ebefd
Show file tree
Hide file tree
Showing 30 changed files with 77 additions and 74 deletions.
27 changes: 14 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
<h1 align="center">Dataset Generator</h1>
![Fabricator Logo](resources/logo_fabricator.drawio_dark.png#gh-dark-mode-only)
![Fabricator Logo](resources/logo_fabricator.drawio_white.png#gh-light-mode-only)

<p align="center">A flexible open-source framework to generate datasets with large language models.</p>
<p align="center">
<img alt="version" src="https://img.shields.io/badge/version-0.1-green">
Expand All @@ -8,8 +10,8 @@
<div align="center">
<hr>

[Installation](#installation) - [Basic Concepts](#basic-concepts) - [Examples](#examples) - [Tutorials](tutorials/TUTORIAL-1_OVERVIEW.md) -
Paper - [Citation](#citation)
[Installation](#installation) | [Basic Concepts](#basic-concepts) | [Examples](#examples) | [Tutorials](tutorials/TUTORIAL-1_OVERVIEW.md) |
Paper | [Citation](#citation)

<hr>
</div>
Expand All @@ -30,10 +32,10 @@ prompt customization, integration and sampling of fewshot examples or annotation
## Installation
Using conda:
```
git clone [email protected]:flairNLP/ai-dataset-generator.git
cd ai-dataset-generator
conda create -y -n aidatasetgenerator python=3.10
conda activate aidatasetgenerator
git clone [email protected]:flairNLP/fabricator.git
cd fabricator
conda create -y -n fabricator python=3.10
conda activate fabricator
pip install -e .
```

Expand All @@ -45,10 +47,9 @@ we need four basic modules: a dataset, a prompt, a language model and a generato
unlabeled datasets and store the generated or annotated datasets with their `Dataset` class. Once
created, you can share the dataset with others via the hub or use it for your model training.
- <b>Prompt</b>: A prompt is the instruction made to the language model. It can be a simple sentence or a more complex
template with placeholders. We utilize [langchain](https://github.com/langchain-ai/langchain) `PromptTemplate` classes
and provide an easy interface for custom dataset generation prompts in which you can specify label options
for the LLM to choose from, provide fewshot examples to support the prompt with or annotate an unlabeled dataset
in a specific way.
template with placeholders. We provide an easy interface for custom dataset generation prompts in which you can specify
label options for the LLM to choose from, provide fewshot examples to support the prompt with or annotate an unlabeled
dataset in a specific way.
- <b>LLM</b>: We use [deepset's haystack library](https://github.com/deepset-ai/haystack) as our LLM interface. deepset
supports a wide range of LLMs including OpenAI, all models from the HuggingFace model hub and many more.
- <b>Generator</b>: The generator is the core of this framework. It takes a dataset, a prompt and a LLM and generates a
Expand All @@ -64,8 +65,8 @@ as that:
```python
import os
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator
from ai_dataset_generator.prompts import BasePrompt
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

prompt = BasePrompt(
task_description="Generate a short movie review.",
Expand Down
4 changes: 2 additions & 2 deletions paper_experiments/conll_annotate_dataset.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os
from datasets import load_dataset
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator, BasePrompt
from ai_dataset_generator.dataset_transformations.token_classification import convert_token_labels_to_spans
from fabricator import DatasetGenerator, BasePrompt
from fabricator.dataset_transformations.token_classification import convert_token_labels_to_spans


def run():
Expand Down
2 changes: 1 addition & 1 deletion paper_experiments/conll_gpt_train_model.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from argparse import ArgumentParser
from datasets import load_dataset
from ai_dataset_generator import convert_spans_to_token_labels
from fabricator import convert_spans_to_token_labels
from seqeval.metrics import accuracy_score, f1_score


Expand Down
4 changes: 2 additions & 2 deletions paper_experiments/mrpc_annotate_dataset.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os
from datasets import load_dataset, concatenate_datasets
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator, BasePrompt
from ai_dataset_generator.dataset_transformations.text_classification import convert_label_ids_to_texts
from fabricator import DatasetGenerator, BasePrompt
from fabricator.dataset_transformations.text_classification import convert_label_ids_to_texts


def run():
Expand Down
4 changes: 2 additions & 2 deletions paper_experiments/snli_annotate_dataset.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os
from datasets import load_dataset, concatenate_datasets
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator, BasePrompt
from ai_dataset_generator.dataset_transformations.text_classification import convert_label_ids_to_texts
from fabricator import DatasetGenerator, BasePrompt
from fabricator.dataset_transformations.text_classification import convert_label_ids_to_texts


def run():
Expand Down
4 changes: 2 additions & 2 deletions paper_experiments/squad_annotate_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

from datasets import Sequence, Value, load_dataset, concatenate_datasets
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator, BasePrompt
from ai_dataset_generator.dataset_transformations.question_answering import (
from fabricator import DatasetGenerator, BasePrompt
from fabricator.dataset_transformations.question_answering import (
preprocess_squad_format,
postprocess_squad_format,
)
Expand Down
4 changes: 2 additions & 2 deletions paper_experiments/trec_annotate_dataset.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os
from datasets import load_dataset, concatenate_datasets
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator, BasePrompt
from ai_dataset_generator.dataset_transformations.text_classification import convert_label_ids_to_texts
from fabricator import DatasetGenerator, BasePrompt
from fabricator.dataset_transformations.text_classification import convert_label_ids_to_texts


def run():
Expand Down
4 changes: 2 additions & 2 deletions paper_experiments/trec_generate_dataset.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os
from datasets import load_dataset, concatenate_datasets
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator, BasePrompt
from ai_dataset_generator.dataset_transformations.text_classification import convert_label_ids_to_texts
from fabricator import DatasetGenerator, BasePrompt
from fabricator.dataset_transformations.text_classification import convert_label_ids_to_texts


def run():
Expand Down
4 changes: 2 additions & 2 deletions paper_experiments/trec_hyperparameter_generate_dataset.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os
from datasets import load_dataset, concatenate_datasets
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator, BasePrompt
from ai_dataset_generator.dataset_transformations.text_classification import convert_label_ids_to_texts
from fabricator import DatasetGenerator, BasePrompt
from fabricator.dataset_transformations.text_classification import convert_label_ids_to_texts

def run():
for possible_examples_per_class, fewshot_example_per_class in [(0,0), (2,2), (4,2), (4,3), (4,4), (8,2), (8,3),
Expand Down
Binary file added resources/logo_fabricator.drawio_dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added resources/logo_fabricator.drawio_white.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@


class DatasetGenerator:
"""The DatasetGenerator class is the main class of the ai_dataset_generator package.
"""The DatasetGenerator class is the main class of the fabricator package.
It generates datasets based on a prompt template. The main function is generate()."""

def __init__(self, prompt_node: PromptNode, max_tries: int = 10):
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
7 changes: 4 additions & 3 deletions tests/test_dataset_generator.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
import unittest

from datasets import Dataset, load_dataset
from ai_dataset_generator import DatasetGenerator
from ai_dataset_generator.prompts import BasePrompt
from ai_dataset_generator.dataset_transformations.text_classification import convert_label_ids_to_texts

from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt
from fabricator.dataset_transformations.text_classification import convert_label_ids_to_texts


class TestDatasetGenerator(unittest.TestCase):
Expand Down
2 changes: 1 addition & 1 deletion tests/test_dataset_sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from collections import Counter
from datasets import load_dataset

from ai_dataset_generator.samplers import random_sampler, single_label_task_sampler, ml_mc_sampler, \
from fabricator.samplers import random_sampler, single_label_task_sampler, ml_mc_sampler, \
single_label_stratified_sample


Expand Down
8 changes: 4 additions & 4 deletions tests/test_dataset_transformations.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

from datasets import load_dataset

from ai_dataset_generator.prompts import BasePrompt
from ai_dataset_generator.dataset_transformations.question_answering import *
from ai_dataset_generator.dataset_transformations.text_classification import *
from ai_dataset_generator.dataset_transformations.token_classification import *
from fabricator.prompts import BasePrompt
from fabricator.dataset_transformations.question_answering import *
from fabricator.dataset_transformations.text_classification import *
from fabricator.dataset_transformations.token_classification import *


class TestTransformationsTextClassification(unittest.TestCase):
Expand Down
2 changes: 1 addition & 1 deletion tests/test_prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from datasets import load_dataset, Dataset, QuestionAnsweringExtractive, TextClassification, Summarization

from ai_dataset_generator.prompts import (
from fabricator.prompts import (
BasePrompt,
infer_prompt_from_task_template,
)
Expand Down
10 changes: 5 additions & 5 deletions tutorials/TUTORIAL-1_OVERVIEW.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ your prompt generates exactly one data point per prompt since the output will be
### Create a minimal prompt for generating text without fewshot examples

```python
from ai_dataset_generator.prompts import BasePrompt
from fabricator.prompts import BasePrompt

prompt_template = BasePrompt(task_description="Generate movie reviews.")
print(prompt_template.get_prompt_text())
Expand All @@ -75,7 +75,7 @@ options are a list of strings. When generating data, the `DatasetGenerator` clas
one of the label options and insert it into the task description such that the generated dataset is balanced.

```python
from ai_dataset_generator.prompts import BasePrompt
from fabricator.prompts import BasePrompt

label_options = ["positive", "negative"]
prompt_template = BasePrompt(
Expand Down Expand Up @@ -109,7 +109,7 @@ fewshot dataset that have the same label as used in the task description as exem

```python
from datasets import Dataset
from ai_dataset_generator.prompts import BasePrompt
from fabricator.prompts import BasePrompt

label_options = ["positive", "negative"]

Expand Down Expand Up @@ -152,7 +152,7 @@ annotated by the LLM as illustrated in the previous example.

```python
from datasets import Dataset
from ai_dataset_generator.prompts import BasePrompt
from fabricator.prompts import BasePrompt

label_options = ["positive", "negative"]

Expand Down Expand Up @@ -208,7 +208,7 @@ frameworks such as `transformers`.

```python
from datasets import Dataset
from ai_dataset_generator import DatasetGenerator
from fabricator import DatasetGenerator

fewshot_examples = Dataset.from_dict({
"text": ["This movie is great!", "This movie is bad!"],
Expand Down
16 changes: 8 additions & 8 deletions tutorials/TUTORIAL-2_SIMPLE-GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ to the HuggingFace Hub.
```python
import os
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator
from ai_dataset_generator.prompts import BasePrompt
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

prompt = BasePrompt(
task_description="Generate a very very short movie review.",
Expand Down Expand Up @@ -39,8 +39,8 @@ this can be achieved by providing a `label_options` argument to the `BasePrompt`
```python
import os
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator
from ai_dataset_generator.prompts import BasePrompt
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

label_options = ["positive", "negative"]

Expand Down Expand Up @@ -87,8 +87,8 @@ sampling. In this case, we use the `label` column.
import os
from datasets import Dataset
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator
from ai_dataset_generator.prompts import BasePrompt
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

label_options = ["positive", "negative"]

Expand Down Expand Up @@ -138,8 +138,8 @@ argument since the generator will use the column specified in `generate_data_for
import os
from datasets import Dataset
from haystack.nodes import PromptNode
from ai_dataset_generator import DatasetGenerator
from ai_dataset_generator.prompts import BasePrompt
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

label_options = ["positive", "negative"]

Expand Down
Loading

0 comments on commit 03ebefd

Please sign in to comment.