diff --git a/docs/data_science_tools/python_snippets.md b/docs/data_science_tools/python_snippets.md index f13428c..583347e 100644 --- a/docs/data_science_tools/python_snippets.md +++ b/docs/data_science_tools/python_snippets.md @@ -499,6 +499,20 @@ def send_message_to_slack(message): send_message_to_slack("test") ``` +## Colab Snippets + +- [Google Colab](https://colab.research.google.com/) is the go-to place for many data scientists and machine learning engineers who are looking to perform quick analysis or training for free. Below are some snippets that can be useful in Colab. + +- If you are getting `NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968` or similar error when trying to run `!pip install` or similar CLI commands in Google Colab, you can fix it by running the following command before running `!pip install`. But note, this might break some imports. So make sure to import all the packages before running this command. + +```python linenums="1" +import locale +locale.getpreferredencoding = lambda: "UTF-8" + +# now import +# !import ... +``` + diff --git a/docs/imgs/ml_modelcompression_quant_awq.png b/docs/imgs/ml_modelcompression_quant_awq.png new file mode 100644 index 0000000..ae6c6fa Binary files /dev/null and b/docs/imgs/ml_modelcompression_quant_awq.png differ diff --git a/docs/imgs/ml_modelcompression_quant_awq2.png b/docs/imgs/ml_modelcompression_quant_awq2.png new file mode 100644 index 0000000..b5f89de Binary files /dev/null and b/docs/imgs/ml_modelcompression_quant_awq2.png differ diff --git a/docs/imgs/ml_quantization_thebloke_llama.png b/docs/imgs/ml_quantization_thebloke_llama.png new file mode 100644 index 0000000..0261cc5 Binary files /dev/null and b/docs/imgs/ml_quantization_thebloke_llama.png differ diff --git a/docs/machine_learning/ML_snippets.md b/docs/machine_learning/ML_snippets.md index 14f1293..8a3f83d 100644 --- a/docs/machine_learning/ML_snippets.md +++ b/docs/machine_learning/ML_snippets.md @@ -276,6 +276,10 @@ torch.cuda.get_device_name(0) ## Output: 'GeForce MX110' ``` +## Monitor GPU usage + +- If you want to continuously monitor the GPU usage, you can use `watch -n 2 nvidia-smi --id=0` command. This will refresh the `nvidia-smi` output every 2 second. + ## HuggingFace Tokenizer - Tokenizer is a pre-processing step that converts the text into a sequence of tokens. [HuggingFace tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) is a wrapper around the [tokenizers library](https://github.com/huggingface/tokenizers), that contains multiple base algorithms for fast tokenization. @@ -309,6 +313,74 @@ vocabulary = tokenizer.get_vocab() # vocabulary['hello'] returns 7592 ``` +## Explore Model + +- You can use the `summary` method to check the model's architecture. This will show the layers, their output shape and the number of parameters in each layer. + +=== "Keras" + ``` python linenums="1" + # import + from keras.models import Sequential + from keras.layers import Dense, Conv2D, MaxPooling2D, Flatten + + # create a model + model = Sequential() + model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) + model.add(MaxPooling2D((2, 2))) + model.add(Conv2D(64, (3, 3), activation='relu')) + model.add(MaxPooling2D((2, 2))) + model.add(Conv2D(64, (3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(64, activation='relu')) + model.add(Dense(10, activation='softmax)) + + # print the model summary + model.summary() + ``` + +=== "PyTorch" + ``` python linenums="1" + # import + import torch + import torch.nn as nn + + # create a model + + class Net(nn.Module): + def __init__(self): + super(Net, self).__init__() + self.conv1 = nn.Conv2d(1, 32, 3, 1) + self.conv2 = nn.Conv2d(32, 64, 3, 1) + self.conv3 = nn.Conv2d(64, 64, 3, 1) + self.fc1 = nn.Linear(1024, 64) + self.fc2 = nn.Linear(64, 10) + + def forward(self, x): + x = F.relu(self.conv1(x)) + x = F.max_pool2d(x, 2, 2) + x = F.relu(self.conv2(x)) + x = F.max_pool2d(x, 2, 2) + x = F.relu(self.conv3(x)) + x = x.view(-1, 1024) + x = F.relu(self.fc1(x)) + x = self.fc2(x) + return F.log_softmax(x, dim=1) + + # create an instance of the model + model = Net() + # print the model summary + print(model) + ``` + +- To check the named parameters of the model and their dtypes, you can use the following code, + +=== "PyTorch" + ``` python linenums="1" + print(f"Total number of names params: {len(list(model.named_parameters()))}") + print("They are - ") + for name, param in model.named_parameters(): + print(name, param.dtype) + ``` + + + +Now, let's look into some of the popular quantization methods and their practical details. + +### AWQ + +Activation-aware Weight Quantization (AWQ) [3], introduced in Oct 2023, is a PTQ type and Weight only quantization method based on the fact that not all weights are equally important for the model's performance. With this in mind, AWQ tries to identify those salient weights using the activation distribution where weights with larger activation magnitudes are deemed crucial for model performance. On further analysis, it was found that just a minor fraction (~1%) of weights, if left unquantized (FP16) could lead to non-significant change in model performance. While this is a crucial observation, it is also important to note that partial quantization of weights leads to mixed-precision data types, which are not efficiently handled in many hardware architectures. To circumvent these complexities, AWQ introduces a novel per-channel scaling technique that scales the weights *(multiple weight by scale $s$ and inverse scale the activation i.e multiply activation by $1/s$)* before quantization, where $s$ is usually greater than 1, and it is determined by a grid search. This minor trick optimizes the quantization process, removes the need for mixed-precision data types, and keep the performance consistent with 1% FP16 weights. + +
+ ![](../imgs/ml_modelcompression_quant_awq.png) +
Source: [3]
+
+ +Empirical evidence demonstrates AWQ's superiority over existing quantization techniques, achieving remarkable speedups and facilitating the deployment of large models on constrained hardware environments. Notably, AWQ has enabled the efficient deployment of massive LLMs, such as the Llama-2-13B model, on single GPU platforms with limited memory (~8GB), and has achieved significant performance gains across diverse LLMs with varying parameter sizes. + +
+ ![](../imgs/ml_modelcompression_quant_awq2.png) +
Better Perplexity score of AWQ on LLaMA-1 and 2 models in comparison with other quantization techniques. Source: [3]
+
+ +Running inference on AWQ model can be done using the `transformers` library. Below is an example of how to use AWQ model for inference. + +```python linenums="1" +# install +# !pip install autoawq + +# import +import time +from transformers import AutoModelForCausalLM, AutoTokenizer + +# AWQ quantized model +model_id = "TheBloke/Llama-2-7B-Chat-AWQ" + +# load model +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0") +# load tokenizer +tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False) +# tokenize the prompt +tokens = tokenizer( + "Tell me a joke", + return_tensors='pt' +).input_ids.cuda() + +# generating output +start_time = time.time() +generation_output = model.generate( + tokens, + temperature=0.7, + max_new_tokens=50 +) +end_time = time.time() +# the output +print("Output: ", tokenizer.decode(generation_output[0])) +# calc and print the speed +# Calculate the number of tokens generated +num_tokens = len(generation_output[0]) - tokens.shape[1] +# Calculate the tokens per second +tokens_per_second = num_tokens / (end_time - start_time) +print("Tokens per second:", tokens_per_second) +``` + +Further, if you want to quantize a model and save it, you can use the below code ([*Code Source*](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py)). + +```python linenums="1" +# imports +from awq import AutoAWQForCausalLM +from transformers import AutoTokenizer + +# model and quantization path +model_path = 'mistralai/Mistral-7B-Instruct-v0.2' +quant_path = 'mistral-instruct-v0.2-awq' +# set quantization config +quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } + +# Load model +model = AutoAWQForCausalLM.from_pretrained( + model_path, **{"low_cpu_mem_usage": True, "use_cache": False} +) +# Load tokenizer +tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) + +# Quantize +model.quantize(tokenizer, quant_config=quant_config) + +# Save quantized model +model.save_quantized(quant_path) +tokenizer.save_pretrained(quant_path) +``` + +We can modify the config based on our requirement, but the important settings are `zero_point` *(for zero point quantization)*, `q_group_size` *(for group size)*, `w_bit` *(for weight bit)* and `version` *(for the version of AWQ)*. The quantized model will be saved at `quant_path`. As part of AWQ quantization, calibration is needed to identify the salient weights and this is done on a small set of training data so that the model does not loose its generalization ability. The above code does it with the default dataset *(512 samples of [mit-han-lab/pile-val-backup](https://huggingface.co/datasets/mit-han-lab/pile-val-backup))*, but if you want to use your own dataset, you can refer [this code](https://github.com/casper-hansen/AutoAWQ/pull/27/commits/69d31edcd87318bb4dc1bcfff0c832df135e3208). + +!!! Note + AWQ models come in multiple flavors and you can choose the version best suited for your need. As per [AutoAWQ Github Repo](https://github.com/casper-hansen/AutoAWQ) these are, + + - **GEMM**: Much faster than FP16 at batch sizes below 8 *(good with large contexts)*. + - **GEMV**: 20% faster than GEMM, only batch size 1 *(not good for large context)*. + +!!! Note + You need GPU with [compute capacity](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) >=7 to run AWQ models as they are optimized for GPU inference. You can check your GPU's compute capacity [here](https://developer.nvidia.com/cuda-gpus#compute). + +### GPTQ + +GPTQ [5], introduced in March 2023, is a PTQ type and one-shot weight quantization method designed to efficiently and accurately compress GPT models even of bigger size such as GPT-3 with 175 billion parameters. GPTQ achieves this by utilizing approximate second-order information to reduce the models' weight bitwidth to 3 or 4 bits, with minimal loss in accuracy compared to the uncompressed model. This method significantly improves upon previous one-shot quantization approaches, doubling the compression gains while maintaining accuracy. As a result, it enables the execution of a 175 billion-parameter model on a single GPU for the first time, facilitating generative inference tasks. Additionally, GPTQ demonstrates reasonable accuracy even when weights are quantized to 2 bits or to a ternary level. Experimental results reveal that GPTQ can accelerate end-to-end inference by approximately 3.25 times on high-end GPUs (like NVIDIA A100) and by 4.5 times on more cost-effective GPUs (like NVIDIA A6000). + +Here is how you can run inference on GPTQ model. + +```python linenums="1" +# Install +!pip install auto-gptq optimum + +# Imports +import time +from transformers import AutoTokenizer +from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig + +# Model and tokenizer +model_id = "TheBloke/Llama-2-7b-Chat-GPTQ" +# load quantized model +model = AutoGPTQForCausalLM.from_quantized(model_id, device="cuda:0") +# load tokenizer +tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) + +# Inference +tokens = tokenizer("Tell me a joke.\nJoke:", return_tensors="pt").to(model.device) + +# Calculate the time taken to generate the output +start_time = time.time() +generation_output = model.generate( + **tokens, + temperature=0.001, + max_new_tokens=50 +) +end_time = time.time() + +print("Output: ", tokenizer.decode(generation_output[0])) +# Calculate the number of tokens generated +num_tokens = len(generation_output[0]) - tokens['input_ids'].shape[1] +# Calculate the tokens per second +tokens_per_second = num_tokens / (end_time - start_time) +print("Tokens per second:", tokens_per_second) + +# or you can also use pipeline +# pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer) +# print(pipeline("auto-gptq is")[0]["generated_text"]) + +``` + +Quantizing a model using GPTQ can be done using the below code. + +```python linenums="1" +# install +!pip install auto-gptq optimum + +# imports +from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig +from transformers import AutoTokenizer + +# model and quantization path +pretrained_model_name = "facebook/opt-125m" +quantize_config = BaseQuantizeConfig(bits=4, group_size=128) +model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, quantize_config) +tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name) + +# example for quantization calibration (here we have only provided one, in reality provide multiple) +examples = [ + tokenizer( + "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm." + ) +] + +# quantize +model.quantize(examples) + +# save quantized model +quantized_model_dir = "opt-125m-4bit-128g" +model.save_quantized(quantized_model_dir) +``` + +By default, the saved file type is `.bin`, you can also set `use_safetensors=True` to save a `.safetensors` model file. The format of model file base name saved using this method is: `gptq_model-{bits}bit-{group_size}g`. Pretrained model's config and the quantize config will also be saved with file names `config.json` and `quantize_config.json`, respectively. [(Refer)](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md) + +### BitsAndBytes + +BitsAndBytes [7] is a Python package to perform ZSQ on models to convert them to 8bit or 4bit representations. To load a model in 4bit quantization with the `transformers` library, you simply set the `load_in_4bit=True` flag and specify a `device_map="auto"` when using the `from_pretrained` method. This process automatically infers an optimal device map, facilitating efficient model loading. For example, loading a model can be done as follows: + +``` python linenums="1" +# import +from transformers import BitsAndBytesConfig, AutoModelForCausalLM +# load model +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", \ + load_in_4bit=True, device_map="auto") +``` + +!!! Note + It's important not to manually assign the device after loading the model with a device map to avoid potential issues. + +Quantized models automatically cast submodules to `float16`, but this can be modified *(e.g., keeping layer norms in float32)* by specifying `torch_dtype` in the `from_pretrained` method. For those interested in exploring beyond the basics, various 4bit quantization types are available, such as NF4 *(normalized float 4)* or pure FP4, with NF4 generally recommended for its performance benefits. Additional features like double quantization via `bnb_4bit_use_double_quant` can save extra bits per parameter *(by enabling a second round of quantization to further compress the model)*, and the computation precision (`bnb_4bit_compute_dtype`) can be adjusted to balance between speed and resource usage. + +Advanced configurations, such as NF4 quantization with double quantization and altered compute dtype for faster training, are facilitated through the `BitsAndBytesConfig` class. For instance, configuring a model with NF4 quantization and bfloat16 compute dtype can be achieved as follows: + +```python linenums="1" +# import +import torch +from transformers import BitsAndBytesConfig, AutoModelForCausalLM +# define config +nf4_config = BitsAndBytesConfig(load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=torch.bfloat16) +# load model +model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, + quantization_config=nf4_config) +``` + +!!! Hint + Use double quant only if you have problems with memory, use NF4 for higher precision, and use a 16-bit dtype for faster finetuning. + [Refer](https://huggingface.co/blog/4bit-transformers-bitsandbytes) + +!!! Hint + If your hardware supports it, `bf16` is the optimal compute dtype. The default is `float32` for backward compatibility and numerical stability. `float16` often leads to numerical instabilities, but `bfloat16` provides the benefits of both worlds: numerical stability equivalent to `float32`, but combined with the memory footprint and significant computation speedup of a 16-bit data type. Therefore, be sure to check if your hardware supports `bf16` and configure it using the `bnb_4bit_compute_dtype` parameter in `BitsAndBytesConfig`. + [Refer](https://huggingface.co/docs/bitsandbytes/main/en/integrations) + +### GGML/GGUF + +GGUF [9] *(older version was called GGML)* is a file format developed by [Georgi Gerganov](https://github.com/ggerganov) specifically for the rapid loading and saving of models, along with its user-friendly approach to model reading. The format is designed to be a single-file deployment format *(as it contains all the necessary information for model loading)*, is compatible with memory-mapped files (mmap) and is extensible *(support legacy models)*. It was initially developed for [whisper.cpp](https://github.com/ggerganov/whisper.cpp) and later extended into [llama.cpp](https://github.com/ggerganov/llama.cpp). This made it the go to format for transformer models *(which is the back bone of all LLMs today)* and it is especially suited for anyone who wants to run model locally or on edge devices or use it for devices with limited GPU memory. + +That said, with advent of quantization for model compression especially for running LLMs on low memory devices, several quantization techniques were added into the `llama.cpp` package. Below is a table that summarizes the quantization techniques with more helpful practical details. + +| ID | Quantization Type | Size | Perplexity Increase @ 7B | Quality Loss Level | Recommendation | +|----|-------------------|-------|-------------------------|----------------------------|------------------------| +| 2 | Q4_0 | 3.50G | +0.2499 | Very high | Legacy, use Q3_K_M | +| 3 | Q4_1 | 3.90G | +0.1846 | Substantial | Legacy, use Q3_K_L | +| 8 | Q5_0 | 4.30G | +0.0796 | Balanced | Legacy, use Q4_K_M | +| 9 | Q5_1 | 4.70G | +0.0415 | Low | Legacy, use Q5_K_M | +| 10 | Q2_K | 2.67G | +0.8698 | Extreme | Not recommended | +| 11 | Q3_K_S | 2.75G | +0.5505 | Very high | | +| 12 | Q3_K_M or Q3_K | 3.06G | +0.2437 | Very high | | +| 13 | Q3_K_L | 3.35G | +0.1803 | Substantial | | +| 14 | Q4_K_S | 3.56G | +0.1149 | Significant | | +| 15 | Q4_K_M or Q4_K | 3.80G | +0.0535 | Balanced | *Recommended* | +| 16 | Q5_K_S | 4.33G | +0.0353 | Low | *Recommended* | +| 17 | Q5_K_M or Q5_K | 4.45G | +0.0142 | Very low | *Recommended* | +| 18 | Q6_K | 5.15G | +0.0044 | Extremely low | | +| 7 | Q8_0 | 6.70G | +0.0004 | Extremely low | Not recommended | +| 1 | F16 | 13.00G| | Virtually no | Not recommended | +| 0 | F32 | 26.00G| | Lossless | Not recommended | + +!!! Hint + While selecting which quantization version to use, it is important to consider the trade-off between model size and quality. It also depends on the specific use case and the available resources. That said, if you are looking for something small and fast that does not compromise a lot on quality, `Q5_K_M` is a good choice. + +Furthermore, it's good to know the two types of quantization supported in `llama.cpp` - "type-0" where weights `w` are obtained from quants `q` using $w = d * q$, where `d` is the block scale and "type-1" where weights are given by $w = d * q + m$, where `m` is the block minimum. The naming convention of quantized model is `Q{bits}_K_{type}` or `Q{bits}_{type}` where `bits` is the number of bits, `type` is the type of quantization, and presence of `K` denotes that the new k-quant technique is used. The `S`, `M`, and `L` in the `type` are the size of the model where the `S` is the smallest, `M` is the medium, and `L` is the largest. This [PR comment](https://github.com/ggerganov/llama.cpp/pull/1684) provides further details on underlying techniques as follows, + +``` markdown +The following new quantization types are added to ggml: + +GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) +GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. +GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. +GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw +GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw +GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type. + + +This is exposed via llama.cpp quantization types that define various "quantization mixes" as follows: + +LLAMA_FTYPE_MOSTLY_Q2_K - uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. +LLAMA_FTYPE_MOSTLY_Q3_K_S - uses GGML_TYPE_Q3_K for all tensors +LLAMA_FTYPE_MOSTLY_Q3_K_M - uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K +LLAMA_FTYPE_MOSTLY_Q3_K_L - uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K +LLAMA_FTYPE_MOSTLY_Q4_K_S - uses GGML_TYPE_Q4_K for all tensors +LLAMA_FTYPE_MOSTLY_Q4_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K +LLAMA_FTYPE_MOSTLY_Q5_K_S - uses GGML_TYPE_Q5_K for all tensors +LLAMA_FTYPE_MOSTLY_Q5_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K +LLAMA_FTYPE_MOSTLY_Q6_K- uses 6-bit quantization (GGML_TYPE_Q8_K) for all tensors +``` + + +Fortunately, it is common practice to quantize all variants before open sourcing them, as it is evident from any of the GGUF models uploaded by [TheBloke's collection](https://huggingface.co/TheBloke) in HuggingFace. + +
+ ![](../imgs/ml_quantization_thebloke_llama.png) +
LLaMa-2 GGUF version by TheBloke on HuggingFace contains all GGUF quantization versions. [Source](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main)
+
+ +Model can be loaded using the `ctransformers` library and additional details like which quantization version to load can be specified. Below is an example of how to load a model with `Q4_K_M` quantization version. + + +```python linenums="1" +## Install +# Base ctransformers with no GPU acceleration +pip install ctransformers>=0.2.24 +# Or with CUDA GPU acceleration +pip install ctransformers[cuda]>=0.2.24 +# Or with ROCm GPU acceleration +CT_HIPBLAS=1 pip install ctransformers>=0.2.24 --no-binary ctransformers +# Or with Metal GPU acceleration for macOS systems +CT_METAL=1 pip install ctransformers>=0.2.24 --no-binary ctransformers + +## Import +from ctransformers import AutoModelForCausalLM + +## Load the model +# Set gpu_layers to the number of layers to offload to GPU. +# Set to 0 if no GPU acceleration is available on your system. +llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GGUF", + model_file="llama-2-7b-chat.q4_K_M.gguf", + model_type="llama", gpu_layers=50) + +## Run inference +print(llm("AI is going to")) +``` + +Fine-tuning the model can be done very easily using the `llama.cpp` library. Below is an example [9] + +```python linenums="1" +# Install llama.cpp +!git clone https://github.com/ggerganov/llama.cpp +!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make +!pip install -r llama.cpp/requirements.txt + +# Manual - Download the model to quantize (`.bin` format) + +# Convert to fp16 (as by default it is f32) +!python llama.cpp/convert.py "pytorch_model-00001-of-00001.bin" --outtype f16 --outfile "pytorch_model.fp16.bin" + +# quantize +!./llama.cpp/quantize "pytorch_model.fp16.bin" "pytorch_model.q5_k_m.gguf" "q5_k_m" +``` + ## References -[1] [A Survey of Quantization Methods for Efficient Neural Network Inference](https://arxiv.org/abs/2103.13630) \ No newline at end of file +[1] [A Survey of Quantization Methods for Efficient Neural Network Inference](https://arxiv.org/abs/2103.13630) + +[2] Maarten Grootendorst's Blog - [Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)](https://www.maartengrootendorst.com/blog/quantization/) + +[3] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - [Paper](https://arxiv.org/abs/2306.00978) | [Official Code](https://github.com/mit-han-lab/llm-awq) + +[4] AutoAWQ Github Repo - [Link](https://github.com/casper-hansen/AutoAWQ) + +[5] GPTQ - [Paper](https://arxiv.org/abs/2210.17323) | [Official Code](https://github.com/IST-DASLab/gptq) + +[6] AutoGPTQ Github Repo - [Link](https://github.com/AutoGPTQ/AutoGPTQ) + +[7] BitsAndBytes - [Official Doc](https://huggingface.co/docs/bitsandbytes/main/en/index) | [Support for 4-bit and QLora Blog](https://huggingface.co/blog/4bit-transformers-bitsandbytes) | [HuggingFace Integration Blog](https://huggingface.co/blog/hf-bitsandbytes-integration) + +[8] LLM.int8() - [Blog](https://huggingface.co/blog/hf-bitsandbytes-integration) + +[9] GGUF/GGML - [Official Docs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) | [Blog - Quantize Llama_2 models using GGML](https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html) | [K Quants](https://github.com/ggerganov/llama.cpp/pull/1684) \ No newline at end of file