This package provides containers for both ExLlama and ExLlamaV2:

exllama container uses the https://github.com/jllllll/exllama fork of https://github.com/turboderp/exllama (installed under /opt/exllama)
exllama:v2 container uses https://github.com/turboderp/exllamav2 (installed under /opt/exllamav2)

Both loaders are also supported in the oobabooga text-generation-webui container.

Inference Benchmark

Substitute the GPTQ model from HuggingFace Hub that you want to run (see exllama compatible models)

./run.sh --workdir=/opt/exllama $(./autotag exllama) /bin/bash -c \
  'python3 test_benchmark_inference.py --perf --validate -d $(huggingface-downloader TheBloke/Llama-2-7B-GPTQ)'

If the model repository is private or requires authentication, add --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN>

Memory Usage

Model	Memory (MB)
`TheBloke/Llama-2-7B-GPTQ`	5,200
`TheBloke/Llama-2-13B-GPTQ`	9,135
`TheBloke/LLaMA-30b-GPTQ`	20,206
`TheBloke/Llama-2-70B-GPTQ`	35,462

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

docs.md

docs.md

Inference Benchmark

Memory Usage

Files

docs.md

Latest commit

History

docs.md

File metadata and controls

Inference Benchmark

Memory Usage