Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code autocompletion with Qwen2.5 7B base and VLLM outputs garbage results. The prefix in the prompt_template is not being used. #3372

Open
2 of 3 tasks
Swipe4057 opened this issue Dec 14, 2024 · 4 comments
Assignees
Labels
area:autocomplete Relates to the auto complete feature ide:vscode Relates specifically to VS Code extension kind:bug Indicates an unexpected problem or unintended behavior "needs-triage"

Comments

@Swipe4057
Copy link

Swipe4057 commented Dec 14, 2024

Before submitting your bug report

Relevant environment info

- OS: Windows
- Continue version: v0.8.61
- IDE version: VS Code 1.95.3
- Model: Qwen/Qwen2.5-Coder-7B
- config.json:
  // Модель для автодополнения по Tab
    "tabAutocompleteModel": {
      "title": "Qwen2.5-Coder-7B",
      "model": "Qwen2.5-Coder-7B",
      "contextLength": 1000,      
      "provider": "openai",        
      "apiBase": "", 
      "apiKey": ",    
      "useLegacyCompletionsEndpoint": false, 

      // Настройки генерации
      "completionOptions": {
        "temperature": 0.2,
        "topP": 0.9,
        "maxTokens": 200,
        "stop": [
          "<|endoftext|>",
          "<|fim_prefix|>",
          "<|fim_middle|>",
          "<|fim_suffix|>",
          "<|fim_pad|>",
          "<|repo_name|>",
          "<|file_sep|>",
          "<|im_start|>",
          "<|im_end|>"
        ]
      },
      // Настройки HTTP запросов
      "requestOptions": {
        "extraBodyProperties": {
            "repetition_penalty": 1.05
          },
        "verifySsl": false,
        "headers": {
          "User-Agent": ""
        }
      }
    },

    // Настройки автодополнения
    "tabAutocompleteOptions": {
      "disable": false,     
      "useCopyBuffer": false,  
      "maxPromptTokens": 1024, 
      "disableInFiles": ["*.md"],
      "prefixPercentage": 0.5,  
      "maxSuffixPercentage": 0.5, 
      "multilineCompletions": "never", 
      "debounceDelay": 500,    
      "useFileSuffix": true,
      "useCache": true,   
      "onlyMyCode": true, 
      "template": "<|fim_prefix|>{{{prefix}}}<|fim_suffix|>{{{suffix}}}<|fim_middle|>"
    },

Description

Hello, when I set up the VLLM server and connected it to Continue, I started getting garbage autocompletion results.
First, I started the server with the usual command:

CUDA_VISIBLE_DEVICES=0 VLLM_ATTENTION_BACKEND='FLASHINFER' VLLM_USE_FLASHINFER_SAMPLER=1 python -m vllm.entrypoints.openai.api_server --host *** --port *** --model /data/models/Qwen2.5-Coder-7B --trust-remote-code --served-model-name Qwen2.5-Coder-7B --gpu_memory_utilization 0.3 --quantization fp8 --max-model-len 8192 --enable-prefix-caching --disable-log-stats

Here are the main reasons for the garbage results:

  1. The properly set template is not being applied. In the VS Code console, I see the following message about the request being sent:

photo_2024-12-14_14-30-49

As you can see, the <|fim_prefix|> token, which is required by the specified template, is completely missing at the beginning.

  1. The VLLM server automatically uses the chat_template from the model's tokenizer_config.json, even though this is a Base model.
    Here's what I see in the server logs:

photo_2024-12-14_14-30-11 (2)

As you can see, it adds a lot of tokens and system messages that are simply not applicable to a Base model.
Due to the above reasons, code autocompletion completely breaks and leads to garbage results. Here's an example:

photo_2024-12-14_14-29-54

Temporary Solution:
I slightly modified the model's chat_template to automatically insert the <|fim_prefix|> token at the beginning and removed all other unnecessary tokens. Here's how the VLLM server should be started:

CUDA_VISIBLE_DEVICES=0 VLLM_ATTENTION_BACKEND='FLASHINFER' VLLM_USE_FLASHINFER_SAMPLER=1 python -m vllm.entrypoints.openai.api_server --host *** --port *** --model /data/models/Qwen2.5-Coder-7B --trust-remote-code --served-model-name Qwen2.5-Coder-7B --gpu_memory_utilization 0.3 --quantization fp8 --max-model-len 8192 --enable-prefix-caching --disable-log-stats --chat-template "{%- for message in messages %}{%- if (message.role == "user") or (message.role == "system") or (message.role == "assistant" and not message.tool_calls) %}{{- ('' if message.content.startswith('<|fim_prefix|>') else '<|fim_prefix|>') + message.content }}{%- elif message.role == "assistant" %}{{- '<|fim_prefix|>' }}{%- if message.content %}{{- message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '<tool_call>{"name": "' }}{{- tool_call.name }}{{- '", "arguments": ' }}{{- tool_call.arguments | tojson }}{{- '}</tool_call>' }}{%- endfor %}{%- elif message.role == "tool" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}{{- '<|fim_prefix|>' }}{%- endif %}{{- '<tool_response>' }}{{- message.content }}{{- '</tool_response>' }}{%- endif %}{%- endfor %}"

The main task of this chat-template is to insert the missing token. Here are the results after this change:

photo_2024-12-14_14-34-40

After which, autocompletion started producing high-quality results:

photo_2024-12-14_14-34-25

Yes, this is a crude solution to the problem. Continue needs to fix this bug.

To reproduce

No response

Log output

No response

@sestinj sestinj self-assigned this Dec 14, 2024
@dosubot dosubot bot added area:autocomplete Relates to the auto complete feature ide:vscode Relates specifically to VS Code extension kind:bug Indicates an unexpected problem or unintended behavior labels Dec 14, 2024
@Swipe4057
Copy link
Author

I found more bugs: with different settings for prefixPercentage and maxSuffixPercentage, the token <|fim_suffix|> may not be sent. Also, if you use both the release version of the extension and the pre-release version, the tokens are sent differently in the pre-release version, and there are also errors, but different ones.

@Swipe4057
Copy link
Author

Swipe4057 commented Dec 16, 2024

@sestinj For Qwen, it is necessary to add another stop token <|cursor|> according to QwenLM/Qwen2.5-Coder#193. Alternatively, handle it in post-processing.

@AnnoyingTechnology
Copy link
Contributor

@sestinj For Qwen, it is necessary to add another stop token <|cursor|> according to QwenLM/Qwen2.5-Coder#193. Alternatively, handle it in post-processing.

I think it must be handled in post processing, until Continue.dev is able to use it to move the cursor.
Considering it as a stop token would produce partial completions.

@Swipe4057
Copy link
Author

Swipe4057 commented Dec 16, 2024

@sestinj For Qwen, it is necessary to add another stop token <|cursor|> according to QwenLM/Qwen2.5-Coder#193. Alternatively, handle it in post-processing.

I think it must be handled in post processing, until Continue.dev is able to use it to move the cursor. Considering it as a stop token would produce partial completions.

I understand, but without moving the cursor, it just looks like buggy autocompletion. The user will have to delete the token, or they might simply reject the suggestion. So adding it to the stop tokens is a temporary solution. The user can accept the autocompletion and simply continue writing the code in the desired place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:autocomplete Relates to the auto complete feature ide:vscode Relates specifically to VS Code extension kind:bug Indicates an unexpected problem or unintended behavior "needs-triage"
Projects
None yet
Development

No branches or pull requests

3 participants