Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The same model, but different loading methods will result in very different inference speeds? #2757

Open
2 of 4 tasks
hjs2027864933 opened this issue Nov 19, 2024 · 1 comment

Comments

@hjs2027864933
Copy link

System Info

TGI version latest;single NVIDIA GeForce RTX 3090;

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

The first loading method (loading llama3 8B model from Hugging face):

model=meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model 

The second loading method (loading llama3 8B model from local directory):

model=/data/ans_model/meta-llama/Meta-Llama-3-8B-Instruct
volume=/home/data/Project/model # share a volume with the Docker container to avoid downloading weights every run
sudo docker run -it --name tgi_llama3_8B --restart=unless-stopped --shm-size 48g -p 3002:80 --runtime "nvidia" --gpus '"device=1"' -v $volume:/data \
    -e HF_TOKEN=$token \
    -e HF_ENDPOINT="https://hf-mirror.com" \
    -e HF_HUB_ENABLE_HF_TRANSFER=False \
    -e USE_FLASH_ATTENTION=False \
    -e HF_HUB_OFFLINE=1 \
    ghcr.chenby.cn/huggingface/text-generation-inference:latest \
    --model-id $model 

Expected behavior

The inference speed of the llama3 8B model loaded from Hugging face is much faster than that loaded from the local directory. I don't know why this happens, how can I fix it?
Faster:
fast
fast_2
Slower:
small
small_2

@hjs2027864933
Copy link
Author

Looking forward to your reply, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant