-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Shampoo uber slow performance #100
Comments
Thanks for reporting! Could you tell me the model (e.g. resnet50) and the parameters of Shampoo optimizers? Actually, I didn't test on many configurations, but it seems that pre-conditioning (based on the google impl) is really slow than i expected. I'll figured it out. |
This is the configuration I am using for Mixer MLP
Feature size is token_size = 128, token_count = 16, this is roughly 200M parameters network. |
I'm working on #101 and tested it on my local machine (GTX 1060 6GB).
It took I'll check more and release the package with a new version Here's a benchmark code.
|
I released a new version v2.4.0 with the fixes! please check still there's a performance issue with your settings! best regards |
Much faster but still taking 114 seconds per iteration. Same GPU model but slightly bigger model (300M parameter) in this case as this is the GPU that just finished an epoch. For reference, 2 iterations per seconds on Nero. |
Oh, thanks for testing. then, still, there's a problem with I'll do more investigations on that.
thanks in advance! |
Let me know when you want me to test something. |
I just deployed a new version In short,
Any feedbacks & requests are welcome! Here are the benchmarks. backbone: resmlp_12_distilled_224, bs: 16x2.5 faster
backbone: mixer_b16_224, bs: 8x0.5 faster
|
Much better but still too slow for the depth I am working on at. Nero is doing a great job. |
@redknightlois I did more work (#128, #129) on the scalable shampoo optimizer (cleanup code, optimize pytorch code, change the default parameters, ...) and just released v2.6.0. Maybe, it's much faster than before because I changed the default value for +) Also, I'm roughly guessing that the current implementation is the nearly optimal version of scalable shampoo (w/ synchronous precondition updates on a single GPU), So, how about closing this issue for now? (if there's news, I'll re-open or create another issue though) if there're any requests, please feel free to use & feedback by any chance :) Thank you! |
I just swap out Nero optimizer in my Lightning AI loop and gave the new Shampoo a try. There is something going on with it, as this card is typically able to do 2 it per second on almost anything. Old Shampoo was not fast, but it was expected for a second order optimizer to achieve half the iterations per sec.
The text was updated successfully, but these errors were encountered: