Skip to content

Latest commit

 

History

History
40 lines (26 loc) · 2.71 KB

File metadata and controls

40 lines (26 loc) · 2.71 KB

The HuggingFace Transformers library supports a wide variety of NLP and vision models with a convenient API, and is used by many of the other LLM packages. There are a large number of models that it's compatible with on HuggingFace Hub.

Note

If you wish to use Transformer's integrated bitsandbytes quantization (load_in_8bit/load_in_4bit) or AutoGPTQ quantization, run these containers instead which include those respective libraries installed on top of Transformers:

Text Generation Benchmark

Substitute the text-generation model that you want to run (it should be a CausalLM model like GPT, Llama, ect)

./run.sh $(./autotag transformers) \
   huggingface-benchmark.py --model=gpt2

If the model repository is private or requires authentication, add --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN>

By default, the performance is measured for generating 128 new output tokens (this can be set with --tokens=N)

The prompt can be changed with --prompt='your prompt here'

Precision / Quantization

Use the --precision argument to enable quantization (options are: fp32 fp16 fp4 int8, default is: fp16)

If you're using fp4 or int8, run the bitsandbytes container as noted above, so that bitsandbytes package is installed to do the quantization. It's expected that 4-bit/8-bit quantization is slower through Transformers than FP16 (while consuming less memory) - see here for more info.

Other libraries like exllama, awq, and AutoGPTQ have custom CUDA kernels and more efficient quantized performance.

Llama2

./run.sh --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN> $(./autotag transformers) \
   huggingface-benchmark.py --model=meta-llama/Llama-2-7b-hf