Code autocompletion with Qwen2.5 7B base and VLLM outputs garbage results. The prefix in the prompt_template is not being used. #3372

Swipe4057 · 2024-12-14T12:10:19Z

Before submitting your bug report

I believe this is a bug. I'll try to join the Continue Discord for questions
I'm not able to find an open issue that reports the same bug
I've seen the troubleshooting guide on the Continue Docs

Relevant environment info

- OS: Windows
- Continue version: v0.8.61
- IDE version: VS Code 1.95.3
- Model: Qwen/Qwen2.5-Coder-7B
- config.json:
  // Модель для автодополнения по Tab
    "tabAutocompleteModel": {
      "title": "Qwen2.5-Coder-7B",
      "model": "Qwen2.5-Coder-7B",
      "contextLength": 1000,      
      "provider": "openai",        
      "apiBase": "", 
      "apiKey": ",    
      "useLegacyCompletionsEndpoint": false, 

      // Настройки генерации
      "completionOptions": {
        "temperature": 0.2,
        "topP": 0.9,
        "maxTokens": 200,
        "stop": [
          "<|endoftext|>",
          "<|fim_prefix|>",
          "<|fim_middle|>",
          "<|fim_suffix|>",
          "<|fim_pad|>",
          "<|repo_name|>",
          "<|file_sep|>",
          "<|im_start|>",
          "<|im_end|>"
        ]
      },
      // Настройки HTTP запросов
      "requestOptions": {
        "extraBodyProperties": {
            "repetition_penalty": 1.05
          },
        "verifySsl": false,
        "headers": {
          "User-Agent": ""
        }
      }
    },

    // Настройки автодополнения
    "tabAutocompleteOptions": {
      "disable": false,     
      "useCopyBuffer": false,  
      "maxPromptTokens": 1024, 
      "disableInFiles": ["*.md"],
      "prefixPercentage": 0.5,  
      "maxSuffixPercentage": 0.5, 
      "multilineCompletions": "never", 
      "debounceDelay": 500,    
      "useFileSuffix": true,
      "useCache": true,   
      "onlyMyCode": true, 
      "template": "<|fim_prefix|>{{{prefix}}}<|fim_suffix|>{{{suffix}}}<|fim_middle|>"
    },

Description

Hello, when I set up the VLLM server and connected it to Continue, I started getting garbage autocompletion results.
First, I started the server with the usual command:

CUDA_VISIBLE_DEVICES=0 VLLM_ATTENTION_BACKEND='FLASHINFER' VLLM_USE_FLASHINFER_SAMPLER=1 python -m vllm.entrypoints.openai.api_server --host *** --port *** --model /data/models/Qwen2.5-Coder-7B --trust-remote-code --served-model-name Qwen2.5-Coder-7B --gpu_memory_utilization 0.3 --quantization fp8 --max-model-len 8192 --enable-prefix-caching --disable-log-stats

Here are the main reasons for the garbage results:

The properly set template is not being applied. In the VS Code console, I see the following message about the request being sent:

As you can see, the <|fim_prefix|> token, which is required by the specified template, is completely missing at the beginning.

The VLLM server automatically uses the chat_template from the model's tokenizer_config.json, even though this is a Base model.
Here's what I see in the server logs:

As you can see, it adds a lot of tokens and system messages that are simply not applicable to a Base model.
Due to the above reasons, code autocompletion completely breaks and leads to garbage results. Here's an example:

Temporary Solution:
I slightly modified the model's chat_template to automatically insert the <|fim_prefix|> token at the beginning and removed all other unnecessary tokens. Here's how the VLLM server should be started:

CUDA_VISIBLE_DEVICES=0 VLLM_ATTENTION_BACKEND='FLASHINFER' VLLM_USE_FLASHINFER_SAMPLER=1 python -m vllm.entrypoints.openai.api_server --host *** --port *** --model /data/models/Qwen2.5-Coder-7B --trust-remote-code --served-model-name Qwen2.5-Coder-7B --gpu_memory_utilization 0.3 --quantization fp8 --max-model-len 8192 --enable-prefix-caching --disable-log-stats --chat-template "{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system") or (message.role == "assistant" and not message.tool_calls) %}{{- ('' if message.content.startswith('<|fim_prefix|>') else '<|fim_prefix|>') + message.content }}{%- elif message.role == "assistant" %}{{- '<|fim_prefix|>' }}{%- if message.content %}{{- message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '<tool_call>{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}</tool_call>' }}{%- endfor %}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|fim_prefix|>' }}{%- endif %}{{- '<tool_response>' }}{{- message.content }}{{- '</tool_response>' }}{%- endif %}{%- endfor %}"

The main task of this chat-template is to insert the missing token. Here are the results after this change:

After which, autocompletion started producing high-quality results:

Yes, this is a crude solution to the problem. Continue needs to fix this bug.

To reproduce

No response

Log output

No response

Swipe4057 · 2024-12-16T06:00:20Z

I found more bugs: with different settings for prefixPercentage and maxSuffixPercentage, the token <|fim_suffix|> may not be sent. Also, if you use both the release version of the extension and the pre-release version, the tokens are sent differently in the pre-release version, and there are also errors, but different ones.

Swipe4057 · 2024-12-16T07:15:54Z

@sestinj For Qwen, it is necessary to add another stop token <|cursor|> according to QwenLM/Qwen2.5-Coder#193. Alternatively, handle it in post-processing.

AnnoyingTechnology · 2024-12-16T07:47:13Z

@sestinj For Qwen, it is necessary to add another stop token <|cursor|> according to QwenLM/Qwen2.5-Coder#193. Alternatively, handle it in post-processing.

I think it must be handled in post processing, until Continue.dev is able to use it to move the cursor.
Considering it as a stop token would produce partial completions.

Swipe4057 · 2024-12-16T08:33:05Z

@sestinj For Qwen, it is necessary to add another stop token <|cursor|> according to QwenLM/Qwen2.5-Coder#193. Alternatively, handle it in post-processing.

I think it must be handled in post processing, until Continue.dev is able to use it to move the cursor. Considering it as a stop token would produce partial completions.

I understand, but without moving the cursor, it just looks like buggy autocompletion. The user will have to delete the token, or they might simply reject the suggestion. So adding it to the stop tokens is a temporary solution. The user can accept the autocompletion and simply continue writing the code in the desired place.

sestinj self-assigned this Dec 14, 2024

github-actions bot added the "needs-triage" label Dec 14, 2024

dosubot bot added area:autocomplete Relates to the auto complete feature ide:vscode Relates specifically to VS Code extension kind:bug Indicates an unexpected problem or unintended behavior labels Dec 14, 2024

This was referenced Dec 14, 2024

Autocompletion uses dummy template with Qwen Coder + Ollama #3353

Open

Autocomplete generate irrelevant content #3153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code autocompletion with Qwen2.5 7B base and VLLM outputs garbage results. The prefix in the prompt_template is not being used. #3372

Code autocompletion with Qwen2.5 7B base and VLLM outputs garbage results. The prefix in the prompt_template is not being used. #3372

Swipe4057 commented Dec 14, 2024 •

edited

Loading

Swipe4057 commented Dec 16, 2024

Swipe4057 commented Dec 16, 2024 •

edited

Loading

AnnoyingTechnology commented Dec 16, 2024

Swipe4057 commented Dec 16, 2024 •

edited

Loading

Code autocompletion with Qwen2.5 7B base and VLLM outputs garbage results. The prefix in the prompt_template is not being used. #3372

Code autocompletion with Qwen2.5 7B base and VLLM outputs garbage results. The prefix in the prompt_template is not being used. #3372

Comments

Swipe4057 commented Dec 14, 2024 • edited Loading

Before submitting your bug report

Relevant environment info

Description

To reproduce

Log output

Swipe4057 commented Dec 16, 2024

Swipe4057 commented Dec 16, 2024 • edited Loading

AnnoyingTechnology commented Dec 16, 2024

Swipe4057 commented Dec 16, 2024 • edited Loading

Swipe4057 commented Dec 14, 2024 •

edited

Loading

Swipe4057 commented Dec 16, 2024 •

edited

Loading

Swipe4057 commented Dec 16, 2024 •

edited

Loading