This package provides containers for both ExLlama and ExLlamaV2:
exllama
container uses the https://github.com/jllllll/exllama fork of https://github.com/turboderp/exllama (installed under/opt/exllama
)exllama:v2
container uses https://github.com/turboderp/exllamav2 (installed under/opt/exllamav2
)
Both loaders are also supported in the oobabooga text-generation-webui
container.
Substitute the GPTQ model from HuggingFace Hub that you want to run (see exllama compatible models)
./run.sh --workdir=/opt/exllama $(./autotag exllama) /bin/bash -c \
'python3 test_benchmark_inference.py --perf --validate -d $(huggingface-downloader TheBloke/Llama-2-7B-GPTQ)'
If the model repository is private or requires authentication, add
--env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN>
Model | Memory (MB) |
---|---|
TheBloke/Llama-2-7B-GPTQ |
5,200 |
TheBloke/Llama-2-13B-GPTQ |
9,135 |
TheBloke/LLaMA-30b-GPTQ |
20,206 |
TheBloke/Llama-2-70B-GPTQ |
35,462 |