-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing models during training #49
Comments
hi @sibyjackgrove , you are right about the ModelCheckpoint callback not working. Right now TF-DF doesn't support check-pointing -- it works differently than back-propagation based algorithms, so it would be different. We are working on a distributed version, that will likely support that -- and work faster for large datasets. I'm marking the issue as an "enhancement" to keep in our track list. |
@janpfeifer Thank you for the update. |
Hi @sibyjackgrove, I'm working with a dataset of a little over 42gb as well. And I've been looking for a way to train them in batches maybe in a training loop with tfdf. Since ModelCheckpoint is not supported, that means I can't train the model in batches right? And my dataset is so big it my RAM couldn't read the whole set in 1 dataframe so I really need to train them in batches. For your model, do you just train them all at once? |
hi @ThaiTLy, indeed that will be hard to train, since the DF algorithms requires to preload the whole dataset into memory -- it's not gradient descent, it trains differently from NN, so also no checkpoints. There is work going on to distribute that, in which case a 42gb dataset should be trained fine (in multiple machines though). One straight forward alternative in such cases would be split the dataset, train separate TFDF models and then ensemble (just average) them using Keras/TensorFlow -- it is pretty simple, I wrote an example of ensembling models for #48 : https://colab.research.google.com/drive/17LEDCwsf1-x2cBKz0J43SES8EDeiy-AB?usp=sharing The difference would be that one would train each model on a subset of the data, probably using a larger model type (GBDT or RandomForest), and simply average the results (as opposed to adding a layer on top). |
@ThaiTLy ModelCheckpoint is not required for training in batches. It is more to recover from crashes that may occur during training. For my problem, I had a dataset of about 45 GB. Fortunately, I was using a compute node with more than 200 GB of memory.
|
Hi @sibyjackgrove, Yggdrasil DF supports training checkpoints. However, the logic is not linked into the TensorFlow Decision Forests wrapper (yet). Until this is done, training checkpoint can be configured directly using the Here is an example: # Every 10 seconds, create a checkpoint in "/tmp/training_cache"
adv_args = tfdf.keras.AdvancedArguments(
yggdrasil_deployment_config = tfdf.keras.core.YggdrasilDeploymentConfig(
try_resume_training = True,
resume_training_snapshot_interval_seconds = 10,
cache_path="/tmp/training_cache",
)
)
# A very long training :)
model = tfdf.keras.GradientBoostedTreesModel(advanced_arguments=adv_args,
num_trees=100000,
early_stopping="NONE")
with sys_pipes():
model.fit(train_ds) If you run the example above, after ~10 seconds, you will see |
It seems the Keras
ModelCheckpoint
call back doesn't work with TFDF. Is there an alternate way to create checkpoints during training? I am training on a data set with tens of millions of samples and it takes several hours to train. I want to save the progress so that it doesn't need to retrain from scratch in case training crashes.The text was updated successfully, but these errors were encountered: