Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing models during training #49

Closed
sibyjackgrove opened this issue Aug 11, 2021 · 7 comments
Closed

Checkpointing models during training #49

sibyjackgrove opened this issue Aug 11, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@sibyjackgrove
Copy link

It seems the Keras ModelCheckpoint call back doesn't work with TFDF. Is there an alternate way to create checkpoints during training? I am training on a data set with tens of millions of samples and it takes several hours to train. I want to save the progress so that it doesn't need to retrain from scratch in case training crashes.

@janpfeifer
Copy link
Contributor

hi @sibyjackgrove , you are right about the ModelCheckpoint callback not working. Right now TF-DF doesn't support check-pointing -- it works differently than back-propagation based algorithms, so it would be different.

We are working on a distributed version, that will likely support that -- and work faster for large datasets.

I'm marking the issue as an "enhancement" to keep in our track list.

@janpfeifer janpfeifer added the enhancement New feature or request label Aug 18, 2021
@sibyjackgrove
Copy link
Author

@janpfeifer Thank you for the update.

@ThaiTLy
Copy link

ThaiTLy commented Sep 7, 2021

Hi @sibyjackgrove, I'm working with a dataset of a little over 42gb as well. And I've been looking for a way to train them in batches maybe in a training loop with tfdf. Since ModelCheckpoint is not supported, that means I can't train the model in batches right? And my dataset is so big it my RAM couldn't read the whole set in 1 dataframe so I really need to train them in batches. For your model, do you just train them all at once?

@janpfeifer
Copy link
Contributor

hi @ThaiTLy, indeed that will be hard to train, since the DF algorithms requires to preload the whole dataset into memory -- it's not gradient descent, it trains differently from NN, so also no checkpoints.

There is work going on to distribute that, in which case a 42gb dataset should be trained fine (in multiple machines though).

One straight forward alternative in such cases would be split the dataset, train separate TFDF models and then ensemble (just average) them using Keras/TensorFlow -- it is pretty simple, I wrote an example of ensembling models for #48 :

https://colab.research.google.com/drive/17LEDCwsf1-x2cBKz0J43SES8EDeiy-AB?usp=sharing

The difference would be that one would train each model on a subset of the data, probably using a larger model type (GBDT or RandomForest), and simply average the results (as opposed to adding a layer on top).

@sibyjackgrove
Copy link
Author

Hi @sibyjackgrove, I'm working with a dataset of a little over 42gb as well. And I've been looking for a way to train them in batches maybe in a training loop with tfdf. Since ModelCheckpoint is not supported, that means I can't train the model in batches right? And my dataset is so big it my RAM couldn't read the whole set in 1 dataframe so I really need to train them in batches. For your model, do you just train them all at once?

@ThaiTLy ModelCheckpoint is not required for training in batches. It is more to recover from crashes that may occur during training. For my problem, I had a dataset of about 45 GB. Fortunately, I was using a compute node with more than 200 GB of memory.
One potential way is to use tf.data.experimental.make_csv_dataset to create a tf dataset that streams the data from disk. I have used it and know it works. But I don't know if the full dataset is still loaded into RAM during training.

initial_batch_size= 64
train_ds= tf.data.experimental.make_csv_dataset(file_pattern=file_path,batch_size=initial_batch_size,column_names =new_column_names,
                                                        select_columns=csv_feature_columns,label_name="label",
                                                        shuffle=False,num_epochs=1,prefetch_buffer_size=initial_batch_size*10,
                                                        num_parallel_reads=2,ignore_errors=True)
with sys_pipes():
    model.fit(train_ds)

@achoum
Copy link
Collaborator

achoum commented Sep 9, 2021

Hi @sibyjackgrove,

Yggdrasil DF supports training checkpoints. However, the logic is not linked into the TensorFlow Decision Forests wrapper (yet). Until this is done, training checkpoint can be configured directly using the advanced_arguments.

Here is an example:

# Every 10 seconds, create a checkpoint in "/tmp/training_cache"
adv_args = tfdf.keras.AdvancedArguments(
    yggdrasil_deployment_config = tfdf.keras.core.YggdrasilDeploymentConfig(
        try_resume_training = True,
        resume_training_snapshot_interval_seconds = 10,
        cache_path="/tmp/training_cache",
    )
)

# A very long training :)
model = tfdf.keras.GradientBoostedTreesModel(advanced_arguments=adv_args,
                                             num_trees=100000,
                                             early_stopping="NONE")

with sys_pipes():
  model.fit(train_ds)

If you run the example above, after ~10 seconds, you will see Create a snapshot of the model at iteration ..... If you stop the training (e.g. kill -9 the colab instance or simply press on the "stop" button) and resume it, you should see the following lines Resume the GBT training from tree.....

@sibyjackgrove
Copy link
Author

@achoum Thanks! I tried this out and it works nicely. A very useful feature. @ThaiTLy Maybe you can use this as well.

@achoum achoum closed this as completed Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants