Super slow for 65B after mmap? #939

edwios · 2023-04-13T08:06:44Z

edwios
Apr 13, 2023

Hi,

Have anyone experienced a significant slow down on LLaMA 65B since mmap is deployed? Yea, it does load super fast but the token generation time has become like a few minutes / token on a 64GB M1 Max, while leaving 36GB of RAM totally untouched?

On the other hand, performance of 30B and below are fine though.

ggerganov · 2023-04-13T08:18:39Z

ggerganov
Apr 13, 2023
Maintainer

You can disable mmap by passing --no-mmap. Let us know if you observe difference

2 replies

edwios Apr 13, 2023
Author

(Edited: Previous run has an invalid parameter --n_predict set to -128)

I have run a few tests with these 3 different mem params: --no-mmap, --mlock, and without any (defult), on commit e7f6997

Script

The script test.sh has the following command to launch the main:

./main $MODEL --color \
    -f ./prompts/sql.txt \
    --top_k 40 --temp 0.3 --repeat_penalty 1.17 -t 5 -c 128 \
    -n 128 -t 8

Model used

LLaMA 30B and 65B
ggjt v1 (latest)

Prompt

### Postgres SQL tables, with their properties:
#
# Employee(id, name, department_id)
# Department(id, name, address)
# Salary_Payments(id, employee_id, amount, date)
#
### A query to list the names of the departments which employed more than 10 employees in the last 3 months
SELECT

System Info

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.300000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.170000
generate: n_ctx = 128, n_batch = 8, n_predict = 128, n_keep = 0

Results

30B, --no-mmap

llama_print_timings: load time = 4530.63 ms
llama_print_timings: sample time = 56.76 ms / 74 runs ( 0.77 ms per run)
llama_print_timings: prompt eval time = 26379.71 ms / 149 tokens ( 177.05 ms per token)
llama_print_timings: eval time = 15621.43 ms / 72 runs ( 216.96 ms per run)
llama_print_timings: total time = 45085.40 ms

65B, --no-mmap

llama_print_timings: load time = 63101.39 ms
llama_print_timings: sample time = 68.88 ms / 84 runs ( 0.82 ms per run)
llama_print_timings: prompt eval time = 62954.37 ms / 149 tokens ( 422.51 ms per token)
llama_print_timings: eval time = 35666.10 ms / 82 runs ( 434.95 ms per run)
llama_print_timings: total time = 150294.08 ms

30B, --mlock

llama_print_timings: load time = 3105.15 ms
llama_print_timings: sample time = 59.70 ms / 77 runs ( 0.78 ms per run)
llama_print_timings: prompt eval time = 26099.45 ms / 149 tokens ( 175.16 ms per token)
llama_print_timings: eval time = 16587.71 ms / 75 runs ( 221.17 ms per run)
llama_print_timings: total time = 44379.22 ms

65B, --mlock

llama_print_timings: load time = 328207.21 ms
llama_print_timings: sample time = 104.11 ms / 128 runs ( 0.81 ms per run)
llama_print_timings: prompt eval time = 73570.31 ms / 214 tokens ( 343.79 ms per token)
llama_print_timings: eval time = 54665.00 ms / 125 runs ( 437.32 ms per run)
llama_print_timings: total time = 453605.06 ms

30B

llama_print_timings: load time = 27158.41 ms
llama_print_timings: sample time = 100.42 ms / 128 runs ( 0.78 ms per run)
llama_print_timings: prompt eval time = 61994.81 ms / 214 tokens ( 289.70 ms per token)
llama_print_timings: eval time = 27327.51 ms / 125 runs ( 218.62 ms per run)
llama_print_timings: total time = 89671.27 ms

65B

llama_print_timings: load time = 23509.17 ms
llama_print_timings: sample time = 103.85 ms / 128 runs ( 0.81 ms per run)
llama_print_timings: prompt eval time = 95106.92 ms / 214 tokens ( 444.42 ms per token)
llama_print_timings: eval time = 53862.00 ms / 125 runs ( 430.90 ms per run)
llama_print_timings: total time = 149140.84 ms

edwios Apr 13, 2023
Author

Ok, so it is not that slow for text generation. Then it is only the chat that has become very slow with 65B.

edwios · 2023-04-13T13:59:14Z

edwios
Apr 13, 2023
Author

Sorry, my bad. The slow down of the 65B was caused by the chat script launched whisper.cpp (with the old ggml) at the background.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super slow for 65B after mmap? #939

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Super slow for 65B after mmap? #939

edwios Apr 13, 2023

Replies: 2 comments · 2 replies

ggerganov Apr 13, 2023 Maintainer

edwios Apr 13, 2023 Author

Script

Model used

Prompt

System Info

Results

30B, --no-mmap

65B, --no-mmap

30B, --mlock

65B, --mlock

30B

65B

edwios Apr 13, 2023 Author

edwios Apr 13, 2023 Author

edwios
Apr 13, 2023

Replies: 2 comments 2 replies

ggerganov
Apr 13, 2023
Maintainer

edwios Apr 13, 2023
Author

edwios Apr 13, 2023
Author

edwios
Apr 13, 2023
Author