Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2.5 LoRA Extraction not working in vLLM & Aphrodite Engine #459

Open
Nero10578 opened this issue Nov 19, 2024 · 2 comments
Open

Qwen2.5 LoRA Extraction not working in vLLM & Aphrodite Engine #459

Nero10578 opened this issue Nov 19, 2024 · 2 comments

Comments

@Nero10578
Copy link

Nero10578 commented Nov 19, 2024

Usually you can use LoRA extraction in mergekit and then run the LoRAs in vLLM or Aphrodite Engine just fine. This works for Llama and Mistral models so far, but it seems like this isn't working for Qwen2.5 models?

If I use my LoRA created from LoRA training using Axolotl, vLLM and Aphrodite Engine runs Qwen LoRAs just fine.

The extraction seems to work without issues too, just cannot be used.

Error traceback from Aphrodite Engine trying to run Qwen2.5-7B lora:

ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

Full traceback:

Future exception was never retrieved
future: <Future finished exception=RuntimeError('Loading lora /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora failed')>
Traceback (most recent call last):
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 92, in _load_adapter
    lora = self._lora_model_cls.from_local_checkpoint(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/models.py", line 221, in from_local_checkpoint
    raise ValueError(
ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/endpoints/openai/rpc/server.py", line 119, in generate
    async for request_output in results_generator:
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 917, in generate
    async for output in await self.add_request(
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 110, in generator
    raise result
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 51, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 784, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 727, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 283, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 163, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/aphrodite/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/task_handler/worker_base.py", line 301, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/aphrodite/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 1494, in execute_model
    self.set_active_loras(model_input.lora_requests,
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 1140, in set_active_loras
    self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 135, in set_active_adapters
    set_active_adapters_worker(requests, mapping, self._apply_adapters,
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/adapter_commons/utils.py", line 52, in set_active_adapters_worker
    apply_adapters_func(requests)
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 194, in _apply_adapters
    self.add_adapter(lora)
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 203, in add_adapter
    lora = self._load_adapter(lora_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/aphro-latest/aphrodite-engine/aphrodite/lora/worker_manager.py", line 105, in _load_adapter
    raise RuntimeError(f"Loading lora {lora_path} failed") from e
RuntimeError: Loading lora /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora failed
@jukofyork
Copy link
Contributor

jukofyork commented Nov 22, 2024

ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

It doesn't like the input and output embeddings in the LoRA adapter.

They are valid to have in a LoRA, but it is a bit weird it lists them both twice?!

Can you try commenting out these two module_details.append lines and replacing with a pass like so:

        if module == pretrained_model.get_input_embeddings():
            # if isinstance(module, torch.nn.Embedding):
            pass #module_details.append(("embedding", name, module.weight.size()))   
        elif module == pretrained_model.get_output_embeddings():
            # if isinstance(module, torch.nn.Embedding):
            pass #module_details.append(("output", name, module.weight.size()))

and see if the LoRA it creates works OK?

Also can you tell me what the peak VRAM use is with these commented out to try to help with your other problem of high VRAM use? If it is just these causing a problem then I can easily add a command line option to skip the input/output embeddings, but if it still uses a lot of VRAM it must be something in the SVD function that upcasts some stuff to float32.


The "doubling listing" in the exception, makes me think it could also be something to do with having tied input/output tensors, but I think only the very tiny qwen models use this.

You can tell if you look in the config.json file:

"tie_word_embeddings": false

or in the model.safetensors.index.json file to see if both these are listed:

"lm_head.weight": "model-00037-of-00037.safetensors"
"model.embed_tokens.weight": "model-00001-of-00037.safetensors",

@Nero10578
Copy link
Author

ValueError: While loading /home/owen/loras/Qwen2.5-Coder-7B-Instruct-lora, expected target modules in ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] but received ['lm_head', 'lm_head', 'model.embed_tokens', 'model.embed_tokens']. Please verify that the loaded LoRA module is correct

It doesn't like the input and output embeddings in the LoRA adapter.

They are valid to have in a LoRA, but it is a bit weird it lists them both twice?!

Can you try commenting out these two module_details.append lines and replacing with a pass like so:

    if module == pretrained_model.get_input_embeddings():
        # if isinstance(module, torch.nn.Embedding):
        pass #module_details.append(("embedding", name, module.weight.size()))   
    elif module == pretrained_model.get_output_embeddings():
        # if isinstance(module, torch.nn.Embedding):
        pass #module_details.append(("output", name, module.weight.size()))

and see if the LoRA it creates works OK?

Also can you tell me what the peak VRAM use is with these commented out to try to help with your other problem of high VRAM use? If it is just these causing a problem then I can easily add a command line option to skip the input/output embeddings, but if it still uses a lot of VRAM it must be something in the SVD function that upcasts some stuff to float32.

The "doubling listing" in the exception, makes me think it could also be something to do with having tied input/output tensors, but I think only the very tiny qwen models use this.

You can tell if you look in the config.json file:

"tie_word_embeddings": false

or in the model.safetensors.index.json file to see if both these are listed:

"lm_head.weight": "model-00037-of-00037.safetensors"
"model.embed_tokens.weight": "model-00001-of-00037.safetensors",

Will try this and get back to you. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants