-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support packing for pretokenized datasets #1848
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
@lvwerra requesting your opinion. |
@kmehant thanks for sharing this feature request. |
@qgallouedec thanks for circling back. In my opinion supporting is not complex. Here is a version implementing this - https://github.com/kmehant/trl/tree/pack-pretok changes / comparison with main - https://github.com/huggingface/trl/compare/main...kmehant:trl:pack-pretok?expand=1 Steps to try this version Install trl from my fork
Sample training code from trl import SFTTrainer
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
t = tok.encode("We adopted exactly the same architecture and tokenizer as Llama 2.")
d = {"input_ids": [t]*10}
import datasets
data = datasets.Dataset.from_dict(d)
trainer = SFTTrainer(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
train_dataset=data,
max_seq_length=10,
packing=True,
)
trainer.train() Sample output looks like
Thank you. I can raise a PR out of this and add tests as needed. |
Thanks! It's actually simpler than I expected. |
@qgallouedec Have raised a PR here - #2011
Thanks, included that in the PR. |
@qgallouedec any update on this thread? Thanks |
At this point, trl returns the dataset as is if the provided dataset has signs of being tokenized already.
trl/trl/trainer/sft_trainer.py
Line 503 in 98ad01d
Additionally, I see the ConstantLengthDataset
trl/trl/trainer/utils.py
Line 426 in 98ad01d
has been written only in support of data that is not pretokenized and it should be possible to extend to pretokenized case as well.
Is there of any interest to support packing for pretokenized datasets? if so, I will be interested to contribute.
The text was updated successfully, but these errors were encountered: