-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Discrepancy in Pre-filling Time and Memory Consumption on Single A100 #84
Comments
Hi @lepangdan, Thanks for your question!
Thanks again for raising these points, and please let me know if you have further questions! |
Hi, @iofu728
default:
after adding augment
Looking forward to your reply. |
Hi @lepangdan, Thank you for your feedback. The results reported in the paper were obtained using |
Describe the issue
I came across your statement in the paper where you mentioned:
"When serving LLaMA-3-8B on a single A100 machine, the model would keep users waiting for 6 minutes to finish the pre-filling stage given a prompt of 300K tokens, and this number increases to 30 minutes for a prompt of 1M tokens."
However, I am also running on a single A100 (80GB) and using Hugging Face's implementation of LLaMA in SDPA mode. With a 50k token context, the pre-fill time is around 2.5 seconds, but when using 100k tokens, I run into an "Out of Memory" issue.
Could you clarify why there is such a significant discrepancy between your results and mine? Is there something I might be missing or misunderstanding?
Thanks for your help!
The text was updated successfully, but these errors were encountered: