Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[DO NOT MERGE]
This is a WIP on implementing LAMB optimizer from BERT. It apparently allows to scale training on huge batches. There are some ambiguities : different algorithms between v1 and v2/v3 of the paper, some blurry definitions and no official implementation yet (a few ones are out there but differ on a few points), no clear learning_rate schedule in the paper despite detailed experiments, etc.
Also, there might be some significant tuning to do in order to find appropriate values for our tasks.
I open this PR for future work, when we'll have more elements.
The current version here is based on https://github.com/cybertronai/pytorch-lamb, which itself is based on
torch.optimizers.Adam
.