Implementation of Levenberg-Marquardt training for models that inherits from tf.keras.Model
. The algorithm has been extended to support mini-batch training for both regression and classification problems.
A PyTorch implementation is also available: torch-levenberg-marquardt
- MeanSquaredError
- ReducedOutputsMeanSquaredError
- CategoricalCrossentropy
- SparseCategoricalCrossentropy
- BinaryCrossentropy
- SquaredCategoricalCrossentropy
- CategoricalMeanSquaredError
- Support for custom losses
Implemented damping algorithms
- Standard:
$\large J^T J + \lambda I$ - Fletcher:
$\large J^T J + \lambda ~ \text{diag}(J^T J)$ - Support for custom damping algorithm
More details on Levenberg-Marquardt can be found on this page.
Suppose that model
and train_dataset
has been created, it is possible to train the model in two different ways.
This is similar to the standard way models are trained with keras, the difference is that the fit method is called on a ModelWrapper
class instead of calling it directly on the model
class.
import tf_levenberg_marquardt as lm
model_wrapper = lm.model.ModelWrapper(model)
model_wrapper.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.1),
loss=lm.loss.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model_wrapper.fit(train_dataset, epochs=5)
This alternative is less flexible than Model.fit as for example it does not support callbacks and is only trainable using tf.data.Dataset
.
import levenberg_marquardt as lm
trainer = lm.training.Trainer(
model=model,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.1),
loss=lm.loss.SparseCategoricalCrossentropy(from_logits=True))
trainer.fit(
dataset=train_dataset,
epochs=5,
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
In order to achieve the best performances out of the training algorithm some considerations have to be done, so that the batch size and the number of weights of the model are chosen properly. The Levenberg–Marquardt algorithm is used to solve the least squares problem
where
The Levenberg–Marquardt updates can be computed using two equivalent formulations
given that the size of the jacobian matrix is [num_residuals x num_weights].
In the first equation
Equation
Equation
where the hessian approximation
From experiments I usually got better convergence using quite large batch sizes which statistically represent the entire dataset, but more experiments needs to be done with small batches and momentum. However, if the model is big the usage of large batch sizes may not be possible. When the problem is overdetermined, the size of the linear system to solve at each step might be too large. When the problem is underdetermined, the full jacobian matrix might be too large to be stored. Possible applications that could benefit from this algorithm are those training only a small number of parameters simultaneously. As for example: fine tuning, composed models and overfitting reduction using small models. In these cases the Levenberg–Marquardt algorithm could converge much faster then first-order methods.
Simple curve fitting test implemented in test_curve_fitting.py
and Colab. The function y = sinc(10 * x)
is fitted using a Shallow Neural Network with 61 parameters.
Despite the triviality of the problem, first-order methods such as Adam fail to converge, while Levenberg–Marquardt converges rapidly with very low loss values. The values of learning_rate were chosen experimentally on the basis of the results obtained by each algorithm.
Here the results with Adam for 10000 epochs and learning_rate=0.01
Train using Adam
Epoch 1/10000
20/20 [==============================] - 0s 500us/step - loss: 0.0479
...
Epoch 10000/10000
20/20 [==============================] - 0s 449us/step - loss: 6.5928e-04
Elapsed time: 157.10102080000001
Here the results with Levenberg–Marquardt for 100 epochs and learning_rate=1.0
Train using Levenberg-Marquardt
Epoch 1/100
20/20 [==============================] - 0s 6ms/step - damping_factor: 1.0000e-05 - attempts: 2.0000 - loss: 0.0232
...
Epoch 100/100
20/20 [==============================] - 0s 5ms/step - damping_factor: 1.0000e-07 - attempts: 1.0000 - loss: 1.0265e-08
Elapsed time: 14.7972407
Common mnist classification test implemented in test_mnist_classification.py
and Colab. The classification is performed using a Convolutional Neural Network with 1026 parameters.
This time there were no particular benefits as in the previous case. Even if Levenberg–Marquardt converges with far fewer epochs than Adam, the longer execution time per step nullifies its advantages.
However, both methods achieve roughly the same accuracy values on train and test set.
Here the results with Adam for 300 epochs and learning_rate=0.01
Train using Adam
Epoch 1/200
10/10 [==============================] - 0s 28ms/step - loss: 2.0728 - accuracy: 0.3072
...
Epoch 200/200
10/10 [==============================] - 0s 14ms/step - loss: 0.0737 - accuracy: 0.9782
Elapsed time: 58.071762045000014
test_loss: 0.072342 - test_accuracy: 0.977100
Here the results with Levenberg–Marquardt for 10 epochs and learning_rate=0.1
Train using Levenberg-Marquardt
Epoch 1/10
10/10 [==============================] - 4s 434ms/step - damping_factor: 1.0000e-06 - attempts: 3.0000 - loss: 1.1267 - accuracy: 0.5803
...
Epoch 10/10
10/10 [==============================] - 5s 450ms/step - damping_factor: 1.0000e-05 - attempts: 2.0000 - loss: 0.0683 - accuracy: 0.9777
Elapsed time: 54.859995966999975
test_loss: 0.072407 - test_accuracy: 0.977800
- Tensorflow 2.18.0
- Matplotlib to visualize curve fitting results