-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early loss divergence for upcycling #15
Comments
Thanks! I think you can see the peak learning rate by clicking on the model, clicking on Overview and looking at the config parameters? It should be the lr value under optimizer |
Thanks for your reply! I have checked and it seems that the peak learning rate of the upcycled OLMoE is 4e-4, which is the same as the peak learning rate of the dense model used for upcycling (OLMo-1B). However, this experiment setting differs from the sparse upcycling paper, which recommended to use the minimum learning rate of the dense model as the peak learning rate of the upcycled MoE: The paper also noted that upcycling with a higher learning rate may cause instability: May I ask has the OLMoE team conducted upcycling experiments with a similar setting to the sparse upcycling paper? Thanks! |
Those comparisons may not be reliable as they use a different optimizer (adafactor) and different lr schedule (not cosine). Depends on what you mean by similar setting - the above are two key differences. Others include encoder-decoder vs decoder models, expert choice, number of experts etc |
Sorry for the lack of clarity! More so I am interested to know if the OLMoE team has conducted upcycling experiments with lower learning rates (using the minimum learning rate of the dense model used for upcycling as the peak learning rate for training the upcycled MoE) before and was curious if you observed a different outcome in the training/performance when using lower learning rates. |
We didn't ablate changing the learning rate during upcycling |
Hi, thanks for the great work and sharing your wandb training logs! After analysing the plots, I have some questions regarding the upcycling experiment done for OLMoE and would greatly appreciate if you could answer them in any capacity:
I observed that the training loss for the upcycled OLMoE increased for the first 5k steps (~20B tokens) and that the training loss does not recover (to the early loss value of 2.25 at step 300) until around 120k steps. May I ask what was the peak learning rate used for training the upcycled OLMoE? And if any other experiments were to try to mitigate this early loss divergence?
Thanks!
The text was updated successfully, but these errors were encountered: