The HuggingFace Transformers library supports a wide variety of NLP and vision models with a convenient API, and is used by many of the other LLM packages. There are a large number of models that it's compatible with on HuggingFace Hub.
Note
If you wish to use Transformer's integrated bitsandbytes quantization (load_in_8bit/load_in_4bit
) or AutoGPTQ quantization, run these containers instead which include those respective libraries installed on top of Transformers:
auto_gptq
(depends on Transformers)bitsandbytes
(depends on Transformers)
Substitute the text-generation model that you want to run (it should be a CausalLM model like GPT, Llama, ect)
./run.sh $(./autotag transformers) \
huggingface-benchmark.py --model=gpt2
If the model repository is private or requires authentication, add
--env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN>
By default, the performance is measured for generating 128 new output tokens (this can be set with --tokens=N
)
The prompt can be changed with --prompt='your prompt here'
Use the --precision
argument to enable quantization (options are: fp32
fp16
fp4
int8
, default is: fp16
)
If you're using fp4
or int8
, run the bitsandbytes
container as noted above, so that bitsandbytes package is installed to do the quantization. It's expected that 4-bit/8-bit quantization is slower through Transformers than FP16 (while consuming less memory) - see here for more info.
Other libraries like exllama
, awq
, and AutoGPTQ
have custom CUDA kernels and more efficient quantized performance.
- First request access from https://ai.meta.com/llama/
- Then create a HuggingFace account, and request access to one of the Llama2 models there like https://huggingface.co/meta-llama/Llama-2-7b-hf (doing this will get you access to all the Llama2 models)
- Get a User Access Token from https://huggingface.co/settings/tokens
./run.sh --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN> $(./autotag transformers) \
huggingface-benchmark.py --model=meta-llama/Llama-2-7b-hf