Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager/Spda Attention have lower results compared to Flash Attention in simcse stage #144

Open
ThonyPan opened this issue Sep 9, 2024 · 2 comments

Comments

@ThonyPan
Copy link

ThonyPan commented Sep 9, 2024

Hi @vaibhavad,

I tried to reproduce the simcse stage of the framework. While using flash attention, the results are as good as reported. However, when trying to train on eager or spda attention, the outcome has a significant drop. What might be the potential reason?

In your code, you describe an error as "'LLM2Vec models were trained with flash attention enabled. For optimal performance, please install the flash_attn package with pip install flash-attn --no-build-isolation." Does that mean if I train the mntp stage using eager/spda attention as well, the performance would be on par with the flash attention?

Thank you!

@vaibhavad
Copy link
Collaborator

Hi @ThonyPan,

Unfortunately we have not run experiments comparing different attention implementations, so I cannot say anything about performance differences. We chose flash attention as it is the fastest, and latency is crucial for both training and inference.

@TianBaoGe
Copy link

Hi @ThonyPan

Did you successfully reproduce the results of MNTP+SimCSE? I have successfully reproduced Sheared-LLaMA-1.3B SimCSE, but the results of MNTP+SimCSE are consistently lower than those reported in the paper. Could you share your training details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants