-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small correction for YaRN-Mistral model #2
Comments
Thank you for the detailed explanation! Definitely YaRN is a great work and I draw a lot of inspirations! Also I totally understand the timeline and Needle-in-a-Haystack is quite recent (and passkey retrieval may not as informative but that was the best available eval). Also I believe when given the length-upsampled data, YaRN-Mistral will perform just the same or better. I'll update the paper to incorporate the information. Also if you could mention this on your paper/ github so I can refer to? Additionally, I wonder what is the differences between 64K / 128K YaRN Mistral? Like are they both finetuned on 16K but one extrapolate to 64K and another extrapolate to 128K? And what about the YaRN-LLaMA 64K/ 128K? Thanks! |
The Llama YaRN models were trained with 64k data, but with a higher YaRN scaling factor (similar to higher base in ABF) such that the final 128k model is able to extrapolate from 64k data to 128k context size. Non-linearity in RoPE interpolation is definitively the key in unlocking extrapolation capabilities (train short, test long) for RoPE models, we just gotta find the best one. |
Hello! Author of YaRN here. First of all thank you for this very comprehensive paper on data engineering challenges for long context LLMs. It will certainly be very useful for the research community in the quest of training better and more robust long context models!
However, there's been a small confusion on how the YaRN Mistral 7B 128K model was trained (Fig. 1 of the paper), this model was trained on a 16k context length dataset without length upsampling (the dataset used is a derivative of what TogetherAI used to train their 32k model, but chunked to 16k instead). The Llama 2 7B 128K model is the one that was trained on PG19, chunked in a context of 64k (not 128k), which I think would be a more appropriate comparison, there's simply too many confounding variables with our Mistral YaRN models.
Also, the reason that we were able to get away with training with such a small context (16k) is because YaRN exhibits the behaviour necessary for context length extrapolation even without finetuning (albeit not very good and only for small extension scale ratios).
Unfortunately, the passkey evaluation that we used was much more easy compared to the Needle-in-a-Haystack test (didn't exist back then), we originally did not notice any degradation of long context capabilities by shortening the dataset from 128k to 64k then to 16k (cheaper to train), but with the newer Needle-in-a-Haystack tests, the degradation is apparent. We will certainly be trying out the new methods outlined in this paper for future finetunes!
The text was updated successfully, but these errors were encountered: