The same model, but different loading methods will result in very different inference speeds? #2757

hjs2027864933 · 2024-11-19T12:55:49Z

System Info

TGI version latest;single NVIDIA GeForce RTX 3090；

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

The first loading method (loading llama3 8B model from Hugging face):

model=meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model

The second loading method (loading llama3 8B model from local directory):

model=/data/ans_model/meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model

Expected behavior

The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it?
Faster:

Slower:

The text was updated successfully, but these errors were encountered:

hjs2027864933 · 2024-11-19T13:06:00Z

Looking forward to your reply, thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The same model, but different loading methods will result in very different inference speeds? #2757

The same model, but different loading methods will result in very different inference speeds? #2757

hjs2027864933 commented Nov 19, 2024

hjs2027864933 commented Nov 19, 2024

The same model, but different loading methods will result in very different inference speeds? #2757

The same model, but different loading methods will result in very different inference speeds? #2757

Comments

hjs2027864933 commented Nov 19, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

hjs2027864933 commented Nov 19, 2024