Code autocompletion with Qwen2.5 7B base and VLLM outputs garbage results. The prefix in the prompt_template is not being used. #3372
Labels
area:autocomplete
Relates to the auto complete feature
ide:vscode
Relates specifically to VS Code extension
kind:bug
Indicates an unexpected problem or unintended behavior
"needs-triage"
Before submitting your bug report
Relevant environment info
Description
Hello, when I set up the VLLM server and connected it to Continue, I started getting garbage autocompletion results.
First, I started the server with the usual command:
CUDA_VISIBLE_DEVICES=0 VLLM_ATTENTION_BACKEND='FLASHINFER' VLLM_USE_FLASHINFER_SAMPLER=1 python -m vllm.entrypoints.openai.api_server --host *** --port *** --model /data/models/Qwen2.5-Coder-7B --trust-remote-code --served-model-name Qwen2.5-Coder-7B --gpu_memory_utilization 0.3 --quantization fp8 --max-model-len 8192 --enable-prefix-caching --disable-log-stats
Here are the main reasons for the garbage results:
As you can see, the <|fim_prefix|> token, which is required by the specified template, is completely missing at the beginning.
Here's what I see in the server logs:
As you can see, it adds a lot of tokens and system messages that are simply not applicable to a Base model.
Due to the above reasons, code autocompletion completely breaks and leads to garbage results. Here's an example:
Temporary Solution:
I slightly modified the model's chat_template to automatically insert the <|fim_prefix|> token at the beginning and removed all other unnecessary tokens. Here's how the VLLM server should be started:
CUDA_VISIBLE_DEVICES=0 VLLM_ATTENTION_BACKEND='FLASHINFER' VLLM_USE_FLASHINFER_SAMPLER=1 python -m vllm.entrypoints.openai.api_server --host *** --port *** --model /data/models/Qwen2.5-Coder-7B --trust-remote-code --served-model-name Qwen2.5-Coder-7B --gpu_memory_utilization 0.3 --quantization fp8 --max-model-len 8192 --enable-prefix-caching --disable-log-stats --chat-template "{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system") or (message.role == "assistant" and not message.tool_calls) %}{{- ('' if message.content.startswith('<|fim_prefix|>') else '<|fim_prefix|>') + message.content }}{%- elif message.role == "assistant" %}{{- '<|fim_prefix|>' }}{%- if message.content %}{{- message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '<tool_call>{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}</tool_call>' }}{%- endfor %}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|fim_prefix|>' }}{%- endif %}{{- '<tool_response>' }}{{- message.content }}{{- '</tool_response>' }}{%- endif %}{%- endfor %}"
The main task of this chat-template is to insert the missing token. Here are the results after this change:
After which, autocompletion started producing high-quality results:
Yes, this is a crude solution to the problem. Continue needs to fix this bug.
To reproduce
No response
Log output
No response
The text was updated successfully, but these errors were encountered: