Checkpointing models during training #49

sibyjackgrove · 2021-08-11T17:15:45Z

It seems the Keras ModelCheckpoint call back doesn't work with TFDF. Is there an alternate way to create checkpoints during training? I am training on a data set with tens of millions of samples and it takes several hours to train. I want to save the progress so that it doesn't need to retrain from scratch in case training crashes.

The text was updated successfully, but these errors were encountered:

janpfeifer · 2021-08-18T04:58:13Z

hi @sibyjackgrove , you are right about the ModelCheckpoint callback not working. Right now TF-DF doesn't support check-pointing -- it works differently than back-propagation based algorithms, so it would be different.

We are working on a distributed version, that will likely support that -- and work faster for large datasets.

I'm marking the issue as an "enhancement" to keep in our track list.

sibyjackgrove · 2021-08-18T15:13:00Z

@janpfeifer Thank you for the update.

ThaiTLy · 2021-09-07T03:45:32Z

Hi @sibyjackgrove, I'm working with a dataset of a little over 42gb as well. And I've been looking for a way to train them in batches maybe in a training loop with tfdf. Since ModelCheckpoint is not supported, that means I can't train the model in batches right? And my dataset is so big it my RAM couldn't read the whole set in 1 dataframe so I really need to train them in batches. For your model, do you just train them all at once?

janpfeifer · 2021-09-07T10:22:07Z

hi @ThaiTLy, indeed that will be hard to train, since the DF algorithms requires to preload the whole dataset into memory -- it's not gradient descent, it trains differently from NN, so also no checkpoints.

There is work going on to distribute that, in which case a 42gb dataset should be trained fine (in multiple machines though).

One straight forward alternative in such cases would be split the dataset, train separate TFDF models and then ensemble (just average) them using Keras/TensorFlow -- it is pretty simple, I wrote an example of ensembling models for #48 :

https://colab.research.google.com/drive/17LEDCwsf1-x2cBKz0J43SES8EDeiy-AB?usp=sharing

The difference would be that one would train each model on a subset of the data, probably using a larger model type (GBDT or RandomForest), and simply average the results (as opposed to adding a layer on top).

sibyjackgrove · 2021-09-07T17:35:18Z

Hi @sibyjackgrove, I'm working with a dataset of a little over 42gb as well. And I've been looking for a way to train them in batches maybe in a training loop with tfdf. Since ModelCheckpoint is not supported, that means I can't train the model in batches right? And my dataset is so big it my RAM couldn't read the whole set in 1 dataframe so I really need to train them in batches. For your model, do you just train them all at once?

@ThaiTLy ModelCheckpoint is not required for training in batches. It is more to recover from crashes that may occur during training. For my problem, I had a dataset of about 45 GB. Fortunately, I was using a compute node with more than 200 GB of memory.
One potential way is to use tf.data.experimental.make_csv_dataset to create a tf dataset that streams the data from disk. I have used it and know it works. But I don't know if the full dataset is still loaded into RAM during training.

initial_batch_size= 64
train_ds= tf.data.experimental.make_csv_dataset(file_pattern=file_path,batch_size=initial_batch_size,column_names =new_column_names,
                                                        select_columns=csv_feature_columns,label_name="label",
                                                        shuffle=False,num_epochs=1,prefetch_buffer_size=initial_batch_size*10,
                                                        num_parallel_reads=2,ignore_errors=True)
with sys_pipes():
    model.fit(train_ds)

achoum · 2021-09-09T17:02:55Z

Hi @sibyjackgrove,

Yggdrasil DF supports training checkpoints. However, the logic is not linked into the TensorFlow Decision Forests wrapper (yet). Until this is done, training checkpoint can be configured directly using the advanced_arguments.

Here is an example:

# Every 10 seconds, create a checkpoint in "/tmp/training_cache"
adv_args = tfdf.keras.AdvancedArguments(
    yggdrasil_deployment_config = tfdf.keras.core.YggdrasilDeploymentConfig(
        try_resume_training = True,
        resume_training_snapshot_interval_seconds = 10,
        cache_path="/tmp/training_cache",
    )
)

# A very long training :)
model = tfdf.keras.GradientBoostedTreesModel(advanced_arguments=adv_args,
                                             num_trees=100000,
                                             early_stopping="NONE")

with sys_pipes():
  model.fit(train_ds)

If you run the example above, after ~10 seconds, you will see Create a snapshot of the model at iteration ..... If you stop the training (e.g. kill -9 the colab instance or simply press on the "stop" button) and resume it, you should see the following lines Resume the GBT training from tree.....

sibyjackgrove · 2021-09-09T18:13:20Z

@achoum Thanks! I tried this out and it works nicely. A very useful feature. @ThaiTLy Maybe you can use this as well.

janpfeifer added the enhancement New feature or request label Aug 18, 2021

achoum closed this as completed Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing models during training #49

Checkpointing models during training #49

sibyjackgrove commented Aug 11, 2021

janpfeifer commented Aug 18, 2021

sibyjackgrove commented Aug 18, 2021

ThaiTLy commented Sep 7, 2021

janpfeifer commented Sep 7, 2021

sibyjackgrove commented Sep 7, 2021

achoum commented Sep 9, 2021

sibyjackgrove commented Sep 9, 2021

Checkpointing models during training #49

Checkpointing models during training #49

Comments

sibyjackgrove commented Aug 11, 2021

janpfeifer commented Aug 18, 2021

sibyjackgrove commented Aug 18, 2021

ThaiTLy commented Sep 7, 2021

janpfeifer commented Sep 7, 2021

sibyjackgrove commented Sep 7, 2021

achoum commented Sep 9, 2021

sibyjackgrove commented Sep 9, 2021