You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After executing docker command :
docker run --gpusall
--shm-size 2g
-p 8080:80
-v $PWD:/data
-e HF_TOKEN=keyt
ghcr.io/huggingface/text-generation-inference:2.3.1
--model-id meta-llama/Llama-3.2-1B
--trust-remote-code
Getting Error : CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
--------------------log-------------------------------------------
2024-10-28T09:20:28.210454Z INFO hf_hub: Token file not found "/data/token"
2024-10-28T09:20:28.210628Z INFO text_generation_launcher: Model supports up to 131072 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=131122 --max-total-tokens=131072 --max-input-tokens=131071.
2024-10-28T09:20:29.540317Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-10-28T09:20:29.540350Z INFO text_generation_launcher: Default max_input_tokens to 4095
2024-10-28T09:20:29.540359Z INFO text_generation_launcher: Default max_total_tokens to 4096
2024-10-28T09:20:29.540366Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145
2024-10-28T09:20:29.540374Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-10-28T09:20:29.540385Z WARN text_generation_launcher: trust_remote_code is set. Trusting that model meta-llama/Llama-3.2-1B do not contain malicious code.
2024-10-28T09:20:29.540598Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Llama-3.2-1B
2024-10-28T09:20:33.468493Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-10-28T09:20:34.166805Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Llama-3.2-1B
2024-10-28T09:20:34.167174Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-10-28T09:20:37.275145Z INFO text_generation_launcher: Using prefix caching = True
2024-10-28T09:20:37.275202Z INFO text_generation_launcher: Using Attention = flashinfer
2024-10-28T09:20:41.969535Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-10-28T09:20:41.994397Z INFO shard-manager: text_generation_launcher: Shard ready in 7.809598311s rank=0
2024-10-28T09:20:42.074720Z INFO text_generation_launcher: Starting Webserver
2024-10-28T09:20:42.156666Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model
2024-10-28T09:20:42.661551Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2024-10-28T09:20:42.695663Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-10-28 09:20:35.725 | INFO | text_generation_server.utils.import_utils::75 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
return func(*args, **kwargs)
CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
2024-10-28T09:20:42.741725Z ERROR text_generation_launcher: Shard 0 crashed
2024-10-28T09:20:42.741753Z INFO text_generation_launcher: Terminating webserver
2024-10-28T09:20:42.741783Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
Error: ShardFailed
2024-10-28T09:20:42.741824Z INFO text_generation_launcher: webserver terminated
2024-10-28T09:20:42.741839Z INFO text_generation_launcher: Shutting down shards
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
just run below docker command :
docker run --gpusall
--shm-size 2g
-p 8080:80
-v $PWD:/data
-e HF_TOKEN=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
ghcr.io/huggingface/text-generation-inference:2.3.1
--model-id meta-llama/Llama-3.2-1B
--trust-remote-code
having hardware
Hardware config:
GPU : Quadro P5000 16GB VRAM
CUDA Version: 12.2
NVIDIA-SMI 535.183.01
RAM 32GB
Expected behavior
it should start local host with llm model endpoitn
The text was updated successfully, but these errors were encountered:
System Info
Hardware config:
GPU : Quadro P5000 16GB VRAM
CUDA Version: 12.2
NVIDIA-SMI 535.183.01
RAM 32GB
After executing docker command :
docker run --gpusall
--shm-size 2g
-p 8080:80
-v $PWD:/data
-e HF_TOKEN=keyt
ghcr.io/huggingface/text-generation-inference:2.3.1
--model-id meta-llama/Llama-3.2-1B
--trust-remote-code
Getting Error : CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
--------------------log-------------------------------------------
2024-10-28T09:20:28.210454Z INFO hf_hub: Token file not found "/data/token"
2024-10-28T09:20:28.210628Z INFO text_generation_launcher: Model supports up to 131072 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using
--max-batch-prefill-tokens=131122 --max-total-tokens=131072 --max-input-tokens=131071
.2024-10-28T09:20:29.540317Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-10-28T09:20:29.540350Z INFO text_generation_launcher: Default
max_input_tokens
to 40952024-10-28T09:20:29.540359Z INFO text_generation_launcher: Default
max_total_tokens
to 40962024-10-28T09:20:29.540366Z INFO text_generation_launcher: Default
max_batch_prefill_tokens
to 41452024-10-28T09:20:29.540374Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-10-28T09:20:29.540385Z WARN text_generation_launcher:
trust_remote_code
is set. Trusting that modelmeta-llama/Llama-3.2-1B
do not contain malicious code.2024-10-28T09:20:29.540598Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Llama-3.2-1B
2024-10-28T09:20:33.468493Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-10-28T09:20:34.166805Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Llama-3.2-1B
2024-10-28T09:20:34.167174Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-10-28T09:20:37.275145Z INFO text_generation_launcher: Using prefix caching = True
2024-10-28T09:20:37.275202Z INFO text_generation_launcher: Using Attention = flashinfer
2024-10-28T09:20:41.969535Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-10-28T09:20:41.994397Z INFO shard-manager: text_generation_launcher: Shard ready in 7.809598311s rank=0
2024-10-28T09:20:42.074720Z INFO text_generation_launcher: Starting Webserver
2024-10-28T09:20:42.156666Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model
2024-10-28T09:20:42.661551Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2024-10-28T09:20:42.695663Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-10-28 09:20:35.725 | INFO | text_generation_server.utils.import_utils::75 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.@custom_bwd
The argument
trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored./opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
return func(*args, **kwargs)
CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
2024-10-28T09:20:42.741725Z ERROR text_generation_launcher: Shard 0 crashed
2024-10-28T09:20:42.741753Z INFO text_generation_launcher: Terminating webserver
2024-10-28T09:20:42.741783Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
Error: ShardFailed
2024-10-28T09:20:42.741824Z INFO text_generation_launcher: webserver terminated
2024-10-28T09:20:42.741839Z INFO text_generation_launcher: Shutting down shards
Information
Tasks
Reproduction
just run below docker command :
docker run --gpusall
--shm-size 2g
-p 8080:80
-v $PWD:/data
-e HF_TOKEN=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
ghcr.io/huggingface/text-generation-inference:2.3.1
--model-id meta-llama/Llama-3.2-1B
--trust-remote-code
having hardware
Hardware config:
GPU : Quadro P5000 16GB VRAM
CUDA Version: 12.2
NVIDIA-SMI 535.183.01
RAM 32GB
Expected behavior
it should start local host with llm model endpoitn
The text was updated successfully, but these errors were encountered: