You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TGI version latest;single NVIDIA GeForce RTX 3090;
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
The first loading method (loading llama3 8B model from Hugging face):
model=meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_ENDPOINT="https://hf-mirror.com" \
-e HF_HUB_ENABLE_HF_TRANSFER=False \
-e USE_FLASH_ATTENTION=False \
-e HF_HUB_OFFLINE=1 \
ghcr.chenby.cn/huggingface/text-generation-inference:latest \
--model-id $model
The second loading method (loading llama3 8B model from local directory):
model=/data/ans_model/meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
-e HF_TOKEN=$token \
-e HF_ENDPOINT="https://hf-mirror.com" \
-e HF_HUB_ENABLE_HF_TRANSFER=False \
-e USE_FLASH_ATTENTION=False \
-e HF_HUB_OFFLINE=1 \
ghcr.chenby.cn/huggingface/text-generation-inference:latest \
--model-id $model
Expected behavior
The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it?
Faster:
Slower:
The text was updated successfully, but these errors were encountered:
System Info
TGI version latest;single NVIDIA GeForce RTX 3090;
Information
Tasks
Reproduction
The first loading method (loading llama3 8B model from Hugging face):
The second loading method (loading llama3 8B model from local directory):
Expected behavior
The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it?
Faster:
Slower:
The text was updated successfully, but these errors were encountered: