The Ede Python library automates the generations of instruction fine-tuning datasets in low-resource languages. PyPI package coming soon. Using GPT-4, it takes 0.5-1.5s per generation (using batch processing) so ~1 day to create 100,000 generations.
The full API for this library is coming soon. For now, you can clone the repository, install requirements and run the run.py
script.
pip install -r requirements.txt
python run.py
Don't forget to add your OpenAI API key to the .env file.
OPENAI_API_KEY = "your_openai_api_key"
The following parameters are editable. It is possible to run with default settings using only target_language and size.
import Ede
model={"provider": "", "model": ""} # accepts openai and anthropic as providers, although anthropic is less performant as it currently lacks a reliable JSON mode.
target_language = "" # target language e.g. Yoruba
data_dir="data" # data directory (defaults to data). Should contain input, output, schemas, and seeds folders (more info on folder structure below)
size=100 # dataset size (defaults to 100)
pipeline = Ede(
target_language=target_language,
model=model,
data_dir=data_dir,
size=size,
)
pipeline.run()
.
├── ...
├── data # Directory containing data for the project
│ ├── input # Input folder contains input files with column names variable_1, variable_2, ..., variable_n
| │ ├── input_1.csv
| │ ├── ...
| │ └── input_n.csv
│ ├── output # Output folder is where generated output is saved as output.csv
│ ├── prompts # Contains template system and user prompts in .txt format
│ ├── schemas # Contains schemas for input and output files (see below)
| │ ├── input_schema.csv
| │ └── output_schema.csv
│ ├── seeds # Contains seed tasks in .jsonl format (see below)
| └────── seed_tasks.jsonl
└── ...
file_name | task_category | task_description | variables | total |
---|---|---|---|---|
yosm.csv | classification | sentiment analysis of movie reviews | {""variable_1"": ""movie review"", ""variable_2"": ""sentiment""} | 1501 |
task_category | task_description | percent | total |
---|---|---|---|
classification | sentiment analysis of movie reviews | categorize or label the provided content into predefined classes or groups | 0.03 |
We are keen for your feedback; please open an issue with questions, bugs, or suggestions.
Python 3.7 or higher.