Replies: 2 comments 2 replies
-
This is the source for Mac m1 air's reading speed limit: |
Beta Was this translation helpful? Give feedback.
-
This is due to the fact that all the model needs to be read in order to use it. The 30B q4_0 model needs 15GB of RAM so the only option is to upgrade to a device with 32GB of RAM (Maybe also 16GB can work). About the SSD speed i think that the reason why you are not seeing a few of GB/s of read speed is because llama.cpp is not reading contiguous SSD areas so it's much slower than what benchmarks show. Maybe try to lookup the M1 SSD read speed for non contiguous data. |
Beta Was this translation helpful? Give feedback.
-
This is very much related to the thread asking why llama 30B uses only 5.8gb of ram. After seeing that thread, I got excited to see how 30B llama model would run on my poor Mac air m1 with 8gb of ram. Well it works, but excruciatingly slow. It takes about less a minute to generate a single token. I then checked what was used with activity monitor and you guessed it, it is running with 3.9 ram but reading the disk in ~900 mb/sec. I run the same on another computer, similar low usage of ram is used and response is much quicker (2 seconds per token) with 2500 mb/sec reading speed. It seems that disk reading speed correlates positively to the responsiveness of the 30B model. I think the disk reading speed of Mac air m1 can reach about 2000 mb/sec, which may make the 30B actually viable on this 8gb ram machine if the software can utilize the max disk reading speed of Mac air m1.
I am not knowledge in this field. Is what I am proposing manageable?
(On Mac Air M1 8gb):
Beta Was this translation helpful? Give feedback.
All reactions