-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade get_dataset.tokenize() to multiprocessing #24
base: master
Are you sure you want to change the base?
Conversation
get_dataset.tokenize() is to slow on a single CPU. Therefore it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.
Looks nice, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing, very nice project, happy you published it :) If there's anything else, let me know...
dataset = tokenize(dataset) | ||
# dataset = tokenize(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely!
# dataset = tokenize(dataset) |
personachat = tokenize(personachat) | ||
torch.save(personachat, dataset_cache) | ||
# torch.save(personachat, dataset_cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course!
# torch.save(personachat, dataset_cache) | |
torch.save(personachat, dataset_cache) |
The question would be, if |
@thomwolf , please could we get this merged? Thank you. |
@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup ⏩ 💨 |
get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with
logger.debug() and mp_logger.debug() message to track progress in the python console.