Early loss divergence for upcycling #15

yazdayy · 2024-10-07T04:15:23Z

Hi, thanks for the great work and sharing your wandb training logs! After analysing the plots, I have some questions regarding the upcycling experiment done for OLMoE and would greatly appreciate if you could answer them in any capacity:

I observed that the training loss for the upcycled OLMoE increased for the first 5k steps (~20B tokens) and that the training loss does not recover (to the early loss value of 2.25 at step 300) until around 120k steps. May I ask what was the peak learning rate used for training the upcycled OLMoE? And if any other experiments were to try to mitigate this early loss divergence?

Thanks!

Muennighoff · 2024-10-07T04:22:22Z

Thanks! I think you can see the peak learning rate by clicking on the model, clicking on Overview and looking at the config parameters? It should be the lr value under optimizer

yazdayy · 2024-10-07T06:34:42Z

Thanks for your reply! I have checked and it seems that the peak learning rate of the upcycled OLMoE is 4e-4, which is the same as the peak learning rate of the dense model used for upcycling (OLMo-1B).

However, this experiment setting differs from the sparse upcycling paper, which recommended to use the minimum learning rate of the dense model as the peak learning rate of the upcycled MoE:

The paper also noted that upcycling with a higher learning rate may cause instability:

May I ask has the OLMoE team conducted upcycling experiments with a similar setting to the sparse upcycling paper?

Thanks!

Muennighoff · 2024-10-07T06:41:19Z

Those comparisons may not be reliable as they use a different optimizer (adafactor) and different lr schedule (not cosine). Depends on what you mean by similar setting - the above are two key differences. Others include encoder-decoder vs decoder models, expert choice, number of experts etc

yazdayy · 2024-10-11T07:55:57Z

Sorry for the lack of clarity! More so I am interested to know if the OLMoE team has conducted upcycling experiments with lower learning rates (using the minimum learning rate of the dense model used for upcycling as the peak learning rate for training the upcycled MoE) before and was curious if you observed a different outcome in the training/performance when using lower learning rates.

Muennighoff · 2024-10-11T17:15:58Z

We didn't ablate changing the learning rate during upcycling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early loss divergence for upcycling #15

Early loss divergence for upcycling #15

yazdayy commented Oct 7, 2024

Muennighoff commented Oct 7, 2024

yazdayy commented Oct 7, 2024

Muennighoff commented Oct 7, 2024

yazdayy commented Oct 11, 2024

Muennighoff commented Oct 11, 2024

Early loss divergence for upcycling #15

Early loss divergence for upcycling #15

Comments

yazdayy commented Oct 7, 2024

Muennighoff commented Oct 7, 2024

yazdayy commented Oct 7, 2024

Muennighoff commented Oct 7, 2024

yazdayy commented Oct 11, 2024

Muennighoff commented Oct 11, 2024