Replies: 2 comments 2 replies
-
You can disable |
Beta Was this translation helpful? Give feedback.
2 replies
-
Sorry, my bad. The slow down of the 65B was caused by the chat script launched whisper.cpp (with the old ggml) at the background. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
Have anyone experienced a significant slow down on LLaMA 65B since mmap is deployed? Yea, it does load super fast but the token generation time has become like a few minutes / token on a 64GB M1 Max, while leaving 36GB of RAM totally untouched?
On the other hand, performance of 30B and below are fine though.
Beta Was this translation helpful? Give feedback.
All reactions