Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early loss divergence for upcycling #15

Open
yazdayy opened this issue Oct 7, 2024 · 5 comments
Open

Early loss divergence for upcycling #15

yazdayy opened this issue Oct 7, 2024 · 5 comments

Comments

@yazdayy
Copy link

yazdayy commented Oct 7, 2024

Hi, thanks for the great work and sharing your wandb training logs! After analysing the plots, I have some questions regarding the upcycling experiment done for OLMoE and would greatly appreciate if you could answer them in any capacity:

I observed that the training loss for the upcycled OLMoE increased for the first 5k steps (~20B tokens) and that the training loss does not recover (to the early loss value of 2.25 at step 300) until around 120k steps. May I ask what was the peak learning rate used for training the upcycled OLMoE? And if any other experiments were to try to mitigate this early loss divergence?

image

Thanks!

@Muennighoff
Copy link
Collaborator

Thanks! I think you can see the peak learning rate by clicking on the model, clicking on Overview and looking at the config parameters? It should be the lr value under optimizer

@yazdayy
Copy link
Author

yazdayy commented Oct 7, 2024

Thanks for your reply! I have checked and it seems that the peak learning rate of the upcycled OLMoE is 4e-4, which is the same as the peak learning rate of the dense model used for upcycling (OLMo-1B).

However, this experiment setting differs from the sparse upcycling paper, which recommended to use the minimum learning rate of the dense model as the peak learning rate of the upcycled MoE:
image

The paper also noted that upcycling with a higher learning rate may cause instability:
image

May I ask has the OLMoE team conducted upcycling experiments with a similar setting to the sparse upcycling paper?

Thanks!

@Muennighoff
Copy link
Collaborator

Those comparisons may not be reliable as they use a different optimizer (adafactor) and different lr schedule (not cosine). Depends on what you mean by similar setting - the above are two key differences. Others include encoder-decoder vs decoder models, expert choice, number of experts etc

@yazdayy
Copy link
Author

yazdayy commented Oct 11, 2024

Sorry for the lack of clarity! More so I am interested to know if the OLMoE team has conducted upcycling experiments with lower learning rates (using the minimum learning rate of the dense model used for upcycling as the peak learning rate for training the upcycled MoE) before and was curious if you observed a different outcome in the training/performance when using lower learning rates.

@Muennighoff
Copy link
Collaborator

We didn't ablate changing the learning rate during upcycling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants