BUG：qwen模型执行会话测试，出现问题 NULL pointer access 及Parallel generation is not supported by ggml. #1781

icedfired · 2024-07-04T05:55:30Z

Describe the bug

使用命令 launch --model-engine llama.cpp --model-name qwen2-instruct --size-in-billions 0_5 --model-format ggufv2 --quantization q4_k_m --n_ctx 1024 运行了一个qwen模型后，通过web-ui测试，第一句话出现错误：
llama_model_loader: loaded meta data with 26 key-value pairs and 290 tensors from /opt/xinference/cache/qwen2-instruct-ggufv2-0_5b/qwen2-0_5b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = qwen2-0_5b-instruct
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: quantize.imatrix.file str = ../Qwen2/gguf/qwen2-0_5b-imatrix/imat...
llama_model_loader: - kv 23: quantize.imatrix.dataset str = ../sft_2406.txt
llama_model_loader: - kv 24: quantize.imatrix.entries_count i32 = 168
llama_model_loader: - kv 25: quantize.imatrix.chunks_count i32 = 1937
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 132 tensors
llama_model_loader: - type q8_0: 13 tensors
llama_model_loader: - type q4_K: 12 tensors
llama_model_loader: - type q6_K: 12 tensors
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 0.9338 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 896
llm_load_print_meta: n_head = 14
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 128
llm_load_print_meta: n_embd_v_gqa = 128
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4864
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 494.03 M
llm_load_print_meta: model size = 373.71 MiB (6.35 BPW)
llm_load_print_meta: general.name = qwen2-0_5b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.13 MiB
llm_load_tensors: CPU buffer size = 511.65 MiB
.................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 12.00 MiB
llama_new_context_with_model: KV self size = 12.00 MiB, K (f16): 6.00 MiB, V (f16): 6.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.00 MiB
llama_new_context_with_model: CPU compute buffer size = 298.50 MiB
llama_new_context_with_model: graph nodes = 846
llama_new_context_with_model: graph splits = 1
AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
Model metadata: {'quantize.imatrix.entries_count': '168', 'quantize.imatrix.dataset': '../sft_2406.txt', 'quantize.imatrix.chunks_count': '1937', 'quantize.imatrix.file': '../Qwen2/gguf/qwen2-0_5b-imatrix/imatrix.dat', 'tokenizer.ggml.add_bos_token': 'false', 'tokenizer.ggml.bos_token_id': '151643', 'general.architecture': 'qwen2', 'qwen2.block_count': '24', 'qwen2.context_length': '32768', 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'qwen2.attention.head_count_kv': '2', 'tokenizer.ggml.padding_token_id': '151643', 'qwen2.embedding_length': '896', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'qwen2.attention.head_count': '14', 'tokenizer.ggml.eos_token_id': '151645', 'qwen2.rope.freq_base': '1000000.000000', 'general.file_type': '15', 'general.quantization_version': '2', 'qwen2.feed_forward_length': '4864', 'tokenizer.ggml.model': 'gpt2', 'general.name': 'qwen2-0_5b-instruct', 'tokenizer.ggml.pre': 'qwen2'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant.<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
Using chat eos_token: <|im_end|>
Using chat bos_token: <|endoftext|>
2024-07-04 05:45:18,890 xinference.api.restful_api 1090728 ERROR Chat completion stream got an error: [address=0.0.0.0:45901, pid=1117508] NULL pointer access
Traceback (most recent call last):
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1537, in stream_results
async for item in iterator:
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/api.py", line 340, in anext
return await self._actor_ref.xoscar_next(self._uid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 659, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/api.py", line 431, in xoscar_next
raise e
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/api.py", line 417, in xoscar_next
r = await asyncio.to_thread(_wrapper, gen)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/api.py", line 402, in _wrapper
return next(_gen)
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/model.py", line 301, in _to_json_generator
for v in gen:
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/model/llm/utils.py", line 553, in _to_chat_completion_chunks
for i, chunk in enumerate(chunks):
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/model/llm/ggml/llamacpp.py", line 214, in generator_wrapper
for index, _completion_chunk in enumerate(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/llama_cpp/llama.py", line 1132, in _create_completion
for token in self.generate(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/llama_cpp/llama.py", line 740, in generate
self.eval(tokens)
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/llama_cpp/llama.py", line 590, in eval
logits = np.ctypeslib.as_array(self._ctx.get_logits(), shape=(rows * cols, ))
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/numpy/ctypeslib.py", line 522, in as_array
obj = ctypes.cast(obj, p_arr_type).contents
^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:45901, pid=1117508] NULL pointer access
Traceback (most recent call last):
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/route_utils.py", line 261, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/blocks.py", line 1786, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/blocks.py", line 1350, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 709, in asyncgen_wrapper
response = await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/chat_interface.py", line 545, in _stream_fn
first_response = await async_iteration(generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 576, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 559, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/chat_interface.py", line 124, in generate_wrapper
for chunk in model.chat(
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/client/common.py", line 51, in streaming_response_iterator
raise Exception(str(error))
Exception: [address=0.0.0.0:45901, pid=1117508] NULL pointer access

再输入一句话，错误为：

2024-07-04 05:50:08,556 xinference.api.restful_api 1090728 ERROR Chat completion stream got an error: [address=0.0.0.0:45901, pid=1117508] Parallel generation is not supported by ggml.
Traceback (most recent call last):
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1527, in stream_results
iterator = await model.chat(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 659, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/model.py", line 87, in wrapped_func
ret = await fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xoscar/api.py", line 462, in _wrapper
r = await func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/model.py", line 488, in chat
response = await self._call_wrapper(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/model.py", line 111, in _async_wrapper
return await fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/model.py", line 380, in _call_wrapper
raise Exception("Parallel generation is not supported by ggml.")
^^^^^^^^^^^^^^^^^
Exception: [address=0.0.0.0:45901, pid=1117508] Parallel generation is not supported by ggml.
Traceback (most recent call last):
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/route_utils.py", line 261, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/blocks.py", line 1786, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/blocks.py", line 1350, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 709, in asyncgen_wrapper
response = await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/chat_interface.py", line 545, in _stream_fn
first_response = await async_iteration(generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 576, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/gradio/utils.py", line 559, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/core/chat_interface.py", line 124, in generate_wrapper
for chunk in model.chat(
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/client/common.py", line 51, in streaming_response_iterator
raise Exception(str(error))
Exception: [address=0.0.0.0:45901, pid=1117508] Parallel generation is not supported by ggml.

To Reproduce

To help us to reproduce this bug, please provide information below:

python版本：3.11
xinference版本：0.12.3
llama_cpp_python版本：0.2.81
操作系统版本：Ubuntu2004 aarch64
kernel：5.4.0-125-generic

ChengjieLi28 · 2024-07-04T08:24:20Z

@icedfired

File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/xinference/model/llm/ggml/llamacpp.py", line 214, in generator_wrapper
for index, _completion_chunk in enumerate(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/llama_cpp/llama.py", line 1132, in _create_completion
for token in self.generate(
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/llama_cpp/llama.py", line 740, in generate
self.eval(tokens)
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/llama_cpp/llama.py", line 590, in eval
logits = np.ctypeslib.as_array(self._ctx.get_logits(), shape=(rows * cols, ))
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/xinference-311/lib/python3.11/site-packages/numpy/ctypeslib.py", line 522, in as_array
obj = ctypes.cast(obj, p_arr_type).contents
^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:45901, pid=1117508] NULL pointer access

0.2.81 他们llama-cpp-python最新版是有这个问题，建议降级至0.2.77。
See abetlen/llama-cpp-python#1571

XprobeBot added this to the v0.12.4 milestone Jul 4, 2024

ChengjieLi28 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG：qwen模型执行会话测试，出现问题 NULL pointer access 及Parallel generation is not supported by ggml. #1781

BUG：qwen模型执行会话测试，出现问题 NULL pointer access 及Parallel generation is not supported by ggml. #1781

icedfired commented Jul 4, 2024

ChengjieLi28 commented Jul 4, 2024

BUG：qwen模型执行会话测试，出现问题 NULL pointer access 及Parallel generation is not supported by ggml. #1781

BUG：qwen模型执行会话测试，出现问题 NULL pointer access 及Parallel generation is not supported by ggml. #1781

Comments

icedfired commented Jul 4, 2024

Describe the bug

To Reproduce

ChengjieLi28 commented Jul 4, 2024