Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error: No kernel image is available for execution on the device #2703

Open
2 of 4 tasks
shubhamgajbhiye1994 opened this issue Oct 28, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@shubhamgajbhiye1994
Copy link

System Info

Hardware config:
GPU : Quadro P5000 16GB VRAM
CUDA Version: 12.2
NVIDIA-SMI 535.183.01
RAM 32GB

After executing docker command :
docker run --gpusall
--shm-size 2g
-p 8080:80
-v $PWD:/data
-e HF_TOKEN=keyt
ghcr.io/huggingface/text-generation-inference:2.3.1
--model-id meta-llama/Llama-3.2-1B
--trust-remote-code

Getting Error : CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0

--------------------log-------------------------------------------
2024-10-28T09:20:28.210454Z INFO hf_hub: Token file not found "/data/token"
2024-10-28T09:20:28.210628Z INFO text_generation_launcher: Model supports up to 131072 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=131122 --max-total-tokens=131072 --max-input-tokens=131071.
2024-10-28T09:20:29.540317Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-10-28T09:20:29.540350Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-10-28T09:20:29.540359Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-10-28T09:20:29.540366Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-10-28T09:20:29.540374Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-10-28T09:20:29.540385Z WARN text_generation_launcher: trust_remote_code is set. Trusting that model meta-llama/Llama-3.2-1B do not contain malicious code.
2024-10-28T09:20:29.540598Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Llama-3.2-1B
2024-10-28T09:20:33.468493Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-10-28T09:20:34.166805Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Llama-3.2-1B
2024-10-28T09:20:34.167174Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-10-28T09:20:37.275145Z INFO text_generation_launcher: Using prefix caching = True
2024-10-28T09:20:37.275202Z INFO text_generation_launcher: Using Attention = flashinfer
2024-10-28T09:20:41.969535Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-10-28T09:20:41.994397Z INFO shard-manager: text_generation_launcher: Shard ready in 7.809598311s rank=0
2024-10-28T09:20:42.074720Z INFO text_generation_launcher: Starting Webserver
2024-10-28T09:20:42.156666Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model
2024-10-28T09:20:42.661551Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2024-10-28T09:20:42.695663Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-10-28 09:20:35.725 | INFO | text_generation_server.utils.import_utils::75 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
return func(*args, **kwargs)
CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
2024-10-28T09:20:42.741725Z ERROR text_generation_launcher: Shard 0 crashed
2024-10-28T09:20:42.741753Z INFO text_generation_launcher: Terminating webserver
2024-10-28T09:20:42.741783Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
Error: ShardFailed
2024-10-28T09:20:42.741824Z INFO text_generation_launcher: webserver terminated
2024-10-28T09:20:42.741839Z INFO text_generation_launcher: Shutting down shards

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

just run below docker command :
docker run --gpusall
--shm-size 2g
-p 8080:80
-v $PWD:/data
-e HF_TOKEN=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
ghcr.io/huggingface/text-generation-inference:2.3.1
--model-id meta-llama/Llama-3.2-1B
--trust-remote-code

having hardware
Hardware config:
GPU : Quadro P5000 16GB VRAM
CUDA Version: 12.2
NVIDIA-SMI 535.183.01
RAM 32GB

Expected behavior

it should start local host with llm model endpoitn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant