-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing MoE Sparse Upcycling #9
Comments
Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems! |
I have run a small demo with a small portion of data, and the content in the output path looks like this: The structure inside the stepxxx (for example, step4000) path is like this: [ 84] step4000 Is there a script available that can convert the output ckpt (stepxxx) into a hf(Hugging Face) format model? |
I just added that as |
I tried running the model convertion to hf format, but I got: "KeyError: transformer.blocks.0.q_norm.weight". So I traced back this error and found that the checkpoint you provided (here https://huggingface.co/allenai/OLMo-1B-0724-954000steps-unsharded) doesn't contain the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc), it only includes the parameters related to the experts (FFN, such as ffn.experts.mlp.w1, etc). Do I need to run another script to merge these parameters? Or could you provide a checkpoint that contains all parameters? (also, an MoE weight upcycled at 2T tokens as figure 8 in the paper) |
For the upcycling ablation we do not use QK Norm, so just deactivate that. You can take a look at this config: https://wandb.ai/ai2-llm/olmoe/runs/1w3srbb3/overview |
The configuration I used for running the aforementioned demo is similar to this one, the output safetensors do not contain the parameters related to self-attention. Do you mean that the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc) kept frozen throughout the upcycling ablation Continued pretraining process? In other words, is this part of the parameters consistent with the dense model? |
they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm |
But the olmoe model has q_norm, k_norm, v_norm parameters, where did they com from? (as olmoe is upcycled from olmo) |
olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ? |
Sorry, I think I misunderstood this parts. Neither the upcycled MoE Nor the “training from scratch” MoE in figure 8 has the same strcuture with the final released olmoe version |
yes they have slightly different hyperparameters |
Thanks! And In the experiment of upcycling(figure 8), was any other data strategy applied to the 610 billion tokens (such as sampling, data mixing, etc)? As I noticed a new class(IterableDataset) was created to solve the problems of deterministic shuffling. |
It is the same dataset as used for OLMo 1B and forwarded to start from the same batch where OLMo 1B finished (via |
Hello OLMoE Authors:
I have read the updates on the Sparse upcycling method in readme and tried to implement it. I want to reproduce the conclusions of Sparse Upcycling in your paper that load OLMo-1B (0724) at 2T tokens.
I downloaded the corresponding checkpoint from Hugging Face, but the hf version OLMo-1B-0724-hf(revision="step954000-tokens2000B") has 2 safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors), may be "
safetensors.torch.load_file
" used insparsify_ckpt_unsharded.py
can't load two safetensors. So I downloaded the OLMo-1B, but this version has no "tokens2000B", only "step477000-tokens2001B" is available.So Could you please tell me:
Thanks!
BTW, when I loaded OLMo-1B(revision=step477000-tokens2001B) using
sparsify_ckpt_unsharded.py
, name in state_dict is like "model.transformer.blocks.4.ff_proj.weight", the index of block is 3 not 2, but line 29 and line 51 is block_num = int(key.split(".")[2])The text was updated successfully, but these errors were encountered: