Example of how to use:
- Clone the HF repo
- Use
merge.py
to merge text files, for example:
python merge.py -d repo/pol -o pol-merged.txt
- Then turn
pol-merged.txt
into json/csv format
python tokenizer.py pol-merged.txt pol-dataset.json
or
python tokenizer.py pol-merged.txt pol-dataset.csv