diff --git a/docs/data_science_tools/python_snippets.md b/docs/data_science_tools/python_snippets.md
index f13428c..583347e 100644
--- a/docs/data_science_tools/python_snippets.md
+++ b/docs/data_science_tools/python_snippets.md
@@ -499,6 +499,20 @@ def send_message_to_slack(message):
 send_message_to_slack("test")
 ```
 
+## Colab Snippets
+
+- [Google Colab](https://colab.research.google.com/) is the go-to place for many data scientists and machine learning engineers who are looking to perform quick analysis or training for free. Below are some snippets that can be useful in Colab.
+
+- If you are getting `NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968` or similar error when trying to run `!pip install` or similar CLI commands in Google Colab, you can fix it by running the following command before running `!pip install`. But note, this might break some imports. So make sure to import all the packages before running this command.
+
+```python linenums="1"
+import locale
+locale.getpreferredencoding = lambda: "UTF-8"
+
+# now import
+# !import ...
+```
+
 <!-- ## Python Classmethod vs Staticmethod
 
 https://stackoverflow.com/questions/12179271/meaning-of-classmethod-and-staticmethod-for-beginner -->
diff --git a/docs/imgs/ml_modelcompression_quant_awq.png b/docs/imgs/ml_modelcompression_quant_awq.png
new file mode 100644
index 0000000..ae6c6fa
Binary files /dev/null and b/docs/imgs/ml_modelcompression_quant_awq.png differ
diff --git a/docs/imgs/ml_modelcompression_quant_awq2.png b/docs/imgs/ml_modelcompression_quant_awq2.png
new file mode 100644
index 0000000..b5f89de
Binary files /dev/null and b/docs/imgs/ml_modelcompression_quant_awq2.png differ
diff --git a/docs/imgs/ml_quantization_thebloke_llama.png b/docs/imgs/ml_quantization_thebloke_llama.png
new file mode 100644
index 0000000..0261cc5
Binary files /dev/null and b/docs/imgs/ml_quantization_thebloke_llama.png differ
diff --git a/docs/machine_learning/ML_snippets.md b/docs/machine_learning/ML_snippets.md
index 14f1293..8a3f83d 100644
--- a/docs/machine_learning/ML_snippets.md
+++ b/docs/machine_learning/ML_snippets.md
@@ -276,6 +276,10 @@ torch.cuda.get_device_name(0)
 ## Output: 'GeForce MX110'
 ```
 
+## Monitor GPU usage
+
+- If you want to continuously monitor the GPU usage, you can use `watch -n 2 nvidia-smi --id=0` command. This will refresh the `nvidia-smi` output every 2 second.
+
 ## HuggingFace Tokenizer
 
 - Tokenizer is a pre-processing step that converts the text into a sequence of tokens. [HuggingFace tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) is a wrapper around the [tokenizers library](https://github.com/huggingface/tokenizers), that contains multiple base algorithms for fast tokenization.
@@ -309,6 +313,74 @@ vocabulary = tokenizer.get_vocab()
 # vocabulary['hello'] returns 7592
 ```
 
+## Explore Model
+
+- You can use the `summary` method to check the model's architecture. This will show the layers, their output shape and the number of parameters in each layer.
+
+=== "Keras"
+    ``` python linenums="1"
+    # import
+    from keras.models import Sequential
+    from keras.layers import Dense, Conv2D, MaxPooling2D, Flatten
+
+    # create a model
+    model = Sequential()
+    model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
+    model.add(MaxPooling2D((2, 2)))
+    model.add(Conv2D(64, (3, 3), activation='relu'))
+    model.add(MaxPooling2D((2, 2)))
+    model.add(Conv2D(64, (3, 3), activation='relu'))
+    model.add(Flatten())
+    model.add(Dense(64, activation='relu'))
+    model.add(Dense(10, activation='softmax))
+
+    # print the model summary
+    model.summary()
+    ```
+
+=== "PyTorch"
+    ``` python linenums="1"
+    # import
+    import torch
+    import torch.nn as nn
+
+    # create a model
+
+    class Net(nn.Module):
+        def __init__(self):
+            super(Net, self).__init__()
+            self.conv1 = nn.Conv2d(1, 32, 3, 1)
+            self.conv2 = nn.Conv2d(32, 64, 3, 1)
+            self.conv3 = nn.Conv2d(64, 64, 3, 1)
+            self.fc1 = nn.Linear(1024, 64)
+            self.fc2 = nn.Linear(64, 10)
+
+        def forward(self, x):
+            x = F.relu(self.conv1(x))
+            x = F.max_pool2d(x, 2, 2)
+            x = F.relu(self.conv2(x))
+            x = F.max_pool2d(x, 2, 2)
+            x = F.relu(self.conv3(x))
+            x = x.view(-1, 1024)
+            x = F.relu(self.fc1(x))
+            x = self.fc2(x)
+            return F.log_softmax(x, dim=1)
+    
+    # create an instance of the model
+    model = Net()
+    # print the model summary
+    print(model)
+    ```
+
+- To check the named parameters of the model and their dtypes, you can use the following code,
+
+=== "PyTorch"
+    ``` python linenums="1"
+    print(f"Total number of names params: {len(list(model.named_parameters()))}")
+    print("They are - ")
+    for name, param in model.named_parameters():
+        print(name, param.dtype)
+    ```
 <!-- ## Tensor operations
 
 - Tensors are the building blocks of any Deep Learning project. Here, let's go through some common tensor operations,
diff --git a/docs/machine_learning/interview_questions.md b/docs/machine_learning/interview_questions.md
index a14f65c..68b48b8 100644
--- a/docs/machine_learning/interview_questions.md
+++ b/docs/machine_learning/interview_questions.md
@@ -466,4 +466,12 @@
 
     === "Answer"
         
-        XGBoost (Extreme Gradient Boosting) is a specific implementation of the Gradient Boosting method that uses a more efficient tree-based model and a number of techniques to speed up the training process and reduce overfitting. XGBoost is commonly used in machine learning competitions and it's one of the most popular libraries used for gradient boosting. It's used for classification and regression problems.
\ No newline at end of file
+        XGBoost (Extreme Gradient Boosting) is a specific implementation of the Gradient Boosting method that uses a more efficient tree-based model and a number of techniques to speed up the training process and reduce overfitting. XGBoost is commonly used in machine learning competitions and it's one of the most popular libraries used for gradient boosting. It's used for classification and regression problems.
+
+!!! Question ""
+    === "Question"
+        #### What is `group_size` in context of Quantization?
+
+    === "Answer"
+        
+        Group size is a parameter used in the quantization process that determines the number of weights or activations *(imagine weights in a row of matrix)* that are quantized together. A smaller group size can lead to better quantization accuracy, but it can also increase the memory and computational requirements of the model. Group size is an important hyperparameter that needs to be tuned to achieve the best trade-off between accuracy and efficiency. Note, the default groupsize for a GPTQ is 1024. [Refer this interesting Reddit discussion](https://www.reddit.com/r/LocalLLaMA/comments/12rtg82/what_is_group_size_128_and_why_do_30b_models_give/?rdt=46348)
\ No newline at end of file
diff --git a/docs/machine_learning/model_compression_quant.md b/docs/machine_learning/model_compression_quant.md
index 92dcb95..7ca4464 100644
--- a/docs/machine_learning/model_compression_quant.md
+++ b/docs/machine_learning/model_compression_quant.md
@@ -1,8 +1,10 @@
 ## Introduction
 
-- Quantization is a technique that has been used in digital computing for a long time. It involves compressing data by converting a continuous signal or data set into a discrete set of values or levels. 
-- Neural Networks (NNs) present unique challenges and opportunities in the context of quantization. Firstly, both inference and training of NNs require significant computational resources, making the efficient representation of numerical values crucial. Secondly, many current NN models are highly over-parameterized, allowing scope for techniques that could reduce bit precision without sacrificing accuracy. 
-- However, an important distinction is that NNs exhibit remarkable resilience to aggressive quantization and extreme discretization. That said, by moving from floating-point representations to low-precision fixed integer values represented in four bits or less, it is possible to significantly reduce memory footprint and latency. In fact, reductions of 4x to 8x are often observed in practice in these applications. This article serves as a beginner-friendly introduction to quantization in deep learning.
+Quantization is a technique that has been used in digital computing for a long time. It involves compressing data by converting a continuous signal or data set into a discrete set of values or levels. 
+
+Neural Networks (NNs) present unique challenges and opportunities in the context of quantization. Firstly, both inference and training of NNs require significant computational resources, making the efficient representation of numerical values crucial. Secondly, many current NN models are highly over-parameterized, allowing scope for techniques that could reduce bit precision without sacrificing accuracy. 
+
+However, an important distinction is that NNs exhibit remarkable resilience to aggressive quantization and extreme discretization. That said, by moving from floating-point representations to low-precision fixed integer values represented in four bits or less, it is possible to significantly reduce memory footprint and latency. In fact, reductions of 4x to 8x are often observed in practice in these applications. This article serves as a beginner-friendly introduction to quantization in deep learning.
 
 ## Data Types and Representations
 
@@ -93,9 +95,9 @@ Quantization granularity refers to the level of detail at which the clipping ran
     <figcaption>Source: [1]</figcaption>
 </figure>
 
-## Different Types of Quantization
+## Quantization Strategies
 
-Below are three primary types of quantization methods used in neural networks:
+Below are three primary ways quantization can be performed wrt neural networks:
 
 1. **Quantization-Aware Training (QAT):**
 
@@ -106,20 +108,379 @@ Below are three primary types of quantization methods used in neural networks:
      - **Disadvantages:** It is computationally expensive as it involves retraining the model, often for several hundred epochs.
 
 2. **Post-Training Quantization (PTQ):**
-   - PTQ is applied after a model has been trained with full precision. It adjusts the weights and activations of a model without any retraining or fine-tuning. Various methods exist to mitigate accuracy loss in PTQ, including bias correction methods, optimal clipping range calculations, outlier channel splitting, and adaptive rounding methods.
+   - PTQ is applied after a model has been trained with full precision. It adjusts the weights and activations of a model without any retraining or fine-tuning. Various methods exist to mitigate accuracy loss in PTQ, including bias correction methods, optimal clipping range calculations, outlier channel splitting, and adaptive rounding methods. That said, it might require some example data to calibrate the quantization parameters.
      - **Advantages:** PTQ is a quick and often negligible overhead method for reducing the size of neural network models. It is particularly useful when training data is limited or unavailable.
      - **Disadvantages:** Generally, PTQ leads to lower accuracy compared to QAT, particularly for low-precision quantization.
 
     !!! Note
         While no model finetuning happens in PTQ, we may use training dataset for activation calibration.
 
-1. **Zero-shot Quantization (ZSQ):**
-   - ZSQ refers to performing quantization without any access to the training or validation data. This is particularly vital for scenarios where the dataset is too large, proprietary, or sensitive. Approaches to ZSQ include generating synthetic data that closely mimics the real data distribution using techniques like Generative Adversarial Networks (GANs) or utilizing the statistics stored in batch normalization layers.
-     - **Advantages:** ZSQ is crucial for scenarios where data privacy or availability is a concern. It allows the quantization of models without needing access to the original dataset.
-     - **Disadvantages:** While innovative, ZSQ methods may not capture the nuances of the actual data distribution as effectively as methods with access to real data, potentially leading to less accurate models.
+3. **Zero-shot Quantization (ZSQ):**
+   - ZSQ refers to performing quantization without any access to the training or validation data, not even example sample for calibration. This is particularly vital for scenarios where we want quick turnaround time on quantization.
+     - **Advantages:** ZSQ is simplest form of quantization and it is crucial for scenarios where data privacy or availability is a concern. It allows the quantization of models without needing access to any dataset.
+     - **Disadvantages:** While innovative, ZSQ methods may not capture the nuances of the actual data distribution as effectively as methods with access to real data, potentially leading to less accurate models. It also has quite a bit of overhead in terms of loading model as quantization usually happens during loading.
 
 In summary, each quantization method has its own set of trade-offs between accuracy, efficiency, and applicability. The choice among QAT, PTQ, and ZSQ depends largely on the specific constraints of the deployment environment, the availability of computational resources, and the necessity for data privacy.
 
+## Quantization in Practice
+
+In practice, PTQ *(Post-Training Quantization)* and ZSQ *(Zero-shot Quantization)* are most widely used quantization methods due to their simplicity and minimal overhead. It is particularly effective for reducing the size of neural network models without requiring access to the original training data. Here is a 30k feet look of how playing with quantization looks like in practice:
+
+1. First we get the model which is trained in full precision *(float32)*.
+2. Next, we can either quantize the model and save it *(ex: loading an quantized model `TheBloke/Llama-2-7B-Chat-AWQ`)* or we can quantize the model during loading *(ex: HuggingFace supports like `load_in_4_bit` option as `bitsandbytes` config)*
+3. Based on the quantization method, models can be dynamically dequantized *(converted back to higher precision)* during inference. This is because inference requires a forward pass which consists of complex computations like matrix multiplication and currently float-float matmul is much faster than int-int matmul. *([Refer](https://stackoverflow.com/questions/45373679/why-is-it-faster-to-perform-float-by-float-matrix-multiplication-compared-to-int))*
+
+
+!!! Hint
+    You can find thousands of quantized models *(with different algorithms)* on the [TheBloke's collection](https://huggingface.co/TheBloke) in HuggingFace.
+
+<!-- TODO: !!! Hint
+    **???** There is an interesting debate on memory-bound vs compute-bound quantization methods -->
+<!-- TODO: ### AQLM - Feb 2024, latest -->
+<!-- TODO: EXL2 quantization: https://github.com/turboderp/exllamav2 -->
+
+Now, let's look into some of the popular quantization methods and their practical details.
+
+### AWQ
+
+Activation-aware Weight Quantization (AWQ) [3], introduced in Oct 2023, is a PTQ type and Weight only quantization method based on the fact that not all weights are equally important for the model's performance. With this in mind, AWQ tries to identify those salient weights using the activation distribution where weights with larger activation magnitudes are deemed crucial for model performance. On further analysis, it was found that just a minor fraction (~1%) of weights, if left unquantized (FP16) could lead to non-significant change in model performance. While this is a crucial observation, it is also important to note that partial quantization of weights leads to mixed-precision data types, which are not efficiently handled in many hardware architectures. To circumvent these complexities, AWQ introduces a novel per-channel scaling technique that scales the weights *(multiple weight by scale $s$ and inverse scale the activation i.e multiply activation by $1/s$)* before quantization, where $s$ is usually greater than 1, and it is determined by a grid search. This minor trick optimizes the quantization process, removes the need for mixed-precision data types, and keep the performance consistent with 1% FP16 weights.
+  
+<figure markdown> 
+    ![](../imgs/ml_modelcompression_quant_awq.png)
+    <figcaption>Source: [3]</figcaption>
+</figure>
+
+Empirical evidence demonstrates AWQ's superiority over existing quantization techniques, achieving remarkable speedups and facilitating the deployment of large models on constrained hardware environments. Notably, AWQ has enabled the efficient deployment of massive LLMs, such as the Llama-2-13B model, on single GPU platforms with limited memory (~8GB), and has achieved significant performance gains across diverse LLMs with varying parameter sizes. 
+
+<figure markdown> 
+    ![](../imgs/ml_modelcompression_quant_awq2.png)
+    <figcaption>Better Perplexity score of AWQ on LLaMA-1 and 2 models in comparison with other quantization techniques. Source: [3]</figcaption>
+</figure>
+
+Running inference on AWQ model can be done using the `transformers` library. Below is an example of how to use AWQ model for inference.
+
+```python linenums="1"
+# install
+# !pip install autoawq
+
+# import
+import time
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# AWQ quantized model
+model_id = "TheBloke/Llama-2-7B-Chat-AWQ"
+
+# load model
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")
+# load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
+# tokenize the prompt
+tokens = tokenizer(
+    "Tell me a joke",
+    return_tensors='pt'
+).input_ids.cuda()
+
+# generating output
+start_time = time.time()
+generation_output = model.generate(
+    tokens,
+    temperature=0.7,
+    max_new_tokens=50
+)
+end_time = time.time()
+# the output
+print("Output: ", tokenizer.decode(generation_output[0]))
+# calc and print the speed 
+# Calculate the number of tokens generated
+num_tokens = len(generation_output[0]) - tokens.shape[1]
+# Calculate the tokens per second
+tokens_per_second = num_tokens / (end_time - start_time)
+print("Tokens per second:", tokens_per_second)
+```
+
+Further, if you want to quantize a model and save it, you can use the below code ([*Code Source*](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py)). 
+
+```python linenums="1"
+# imports
+from awq import AutoAWQForCausalLM
+from transformers import AutoTokenizer
+
+# model and quantization path
+model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
+quant_path = 'mistral-instruct-v0.2-awq'
+# set quantization config
+quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
+
+# Load model
+model = AutoAWQForCausalLM.from_pretrained(
+    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
+)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+
+# Quantize
+model.quantize(tokenizer, quant_config=quant_config)
+
+# Save quantized model
+model.save_quantized(quant_path)
+tokenizer.save_pretrained(quant_path)
+```
+
+We can modify the config based on our requirement, but the important settings are  `zero_point` *(for zero point quantization)*, `q_group_size` *(for group size)*, `w_bit` *(for weight bit)* and `version` *(for the version of AWQ)*. The quantized model will be saved at `quant_path`. As part of AWQ quantization, calibration is needed to identify the salient weights and this is done on a small set of training data so that the model does not loose its generalization ability. The above code does it with the default dataset *(512 samples of [mit-han-lab/pile-val-backup](https://huggingface.co/datasets/mit-han-lab/pile-val-backup))*, but if you want to use your own dataset, you can refer [this code](https://github.com/casper-hansen/AutoAWQ/pull/27/commits/69d31edcd87318bb4dc1bcfff0c832df135e3208).
+
+!!! Note
+    AWQ models come in multiple flavors and you can choose the version best suited for your need.  As per [AutoAWQ Github Repo](https://github.com/casper-hansen/AutoAWQ) these are,
+
+    - **GEMM**: Much faster than FP16 at batch sizes below 8 *(good with large contexts)*.
+    - **GEMV**: 20% faster than GEMM, only batch size 1 *(not good for large context)*.
+    
+!!! Note
+    You need GPU with [compute capacity](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) >=7 to run AWQ models as they are optimized for GPU inference. You can check your GPU's compute capacity [here](https://developer.nvidia.com/cuda-gpus#compute).
+
+### GPTQ
+
+GPTQ [5], introduced in March 2023, is a PTQ type and one-shot weight quantization method designed to efficiently and accurately compress GPT models even of bigger size such as GPT-3 with 175 billion parameters. GPTQ achieves this by utilizing approximate second-order information to reduce the models' weight bitwidth to 3 or 4 bits, with minimal loss in accuracy compared to the uncompressed model. This method significantly improves upon previous one-shot quantization approaches, doubling the compression gains while maintaining accuracy. As a result, it enables the execution of a 175 billion-parameter model on a single GPU for the first time, facilitating generative inference tasks. Additionally, GPTQ demonstrates reasonable accuracy even when weights are quantized to 2 bits or to a ternary level. Experimental results reveal that GPTQ can accelerate end-to-end inference by approximately 3.25 times on high-end GPUs (like NVIDIA A100) and by 4.5 times on more cost-effective GPUs (like NVIDIA A6000).
+
+Here is how you can run inference on GPTQ model.
+
+```python linenums="1"
+# Install
+!pip install auto-gptq optimum
+
+# Imports
+import time
+from transformers import AutoTokenizer
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+
+# Model and tokenizer
+model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"
+# load quantized model
+model = AutoGPTQForCausalLM.from_quantized(model_id, device="cuda:0")
+# load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
+
+# Inference
+tokens = tokenizer("Tell me a joke.\nJoke:", return_tensors="pt").to(model.device)
+
+# Calculate the time taken to generate the output
+start_time = time.time()
+generation_output = model.generate(
+    **tokens,
+    temperature=0.001,
+    max_new_tokens=50
+)
+end_time = time.time()
+
+print("Output: ", tokenizer.decode(generation_output[0]))
+# Calculate the number of tokens generated
+num_tokens = len(generation_output[0]) - tokens['input_ids'].shape[1]
+# Calculate the tokens per second
+tokens_per_second = num_tokens / (end_time - start_time)
+print("Tokens per second:", tokens_per_second)
+
+# or you can also use pipeline
+# pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
+# print(pipeline("auto-gptq is")[0]["generated_text"])
+
+```
+
+Quantizing a model using GPTQ can be done using the below code. 
+
+```python linenums="1"
+# install
+!pip install auto-gptq optimum
+
+# imports
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+from transformers import AutoTokenizer
+
+# model and quantization path
+pretrained_model_name = "facebook/opt-125m"
+quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
+model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_name, quantize_config)
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
+
+# example for quantization calibration (here we have only provided one, in reality provide multiple)
+examples = [
+    tokenizer(
+        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
+    )
+]
+
+# quantize
+model.quantize(examples)
+
+# save quantized model
+quantized_model_dir = "opt-125m-4bit-128g"
+model.save_quantized(quantized_model_dir)
+```
+
+By default, the saved file type is `.bin`, you can also set `use_safetensors=True` to save a `.safetensors` model file. The format of model file base name saved using this method is: `gptq_model-{bits}bit-{group_size}g`. Pretrained model's config and the quantize config will also be saved with file names `config.json` and `quantize_config.json`, respectively. [(Refer)](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md)
+
+### BitsAndBytes
+
+BitsAndBytes [7] is a Python package to perform ZSQ on models to convert them to 8bit or 4bit representations. To load a model in 4bit quantization with the `transformers` library, you simply set the `load_in_4bit=True` flag and specify a `device_map="auto"` when using the `from_pretrained` method. This process automatically infers an optimal device map, facilitating efficient model loading. For example, loading a model can be done as follows: 
+
+``` python linenums="1"
+# import 
+from transformers import BitsAndBytesConfig, AutoModelForCausalLM
+# load model
+model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", \
+                        load_in_4bit=True, device_map="auto")
+```
+
+!!! Note
+    It's important not to manually assign the device after loading the model with a device map to avoid potential issues.
+
+Quantized models automatically cast submodules to `float16`, but this can be modified *(e.g., keeping layer norms in float32)* by specifying `torch_dtype` in the `from_pretrained` method. For those interested in exploring beyond the basics, various 4bit quantization types are available, such as NF4 *(normalized float 4)* or pure FP4, with NF4 generally recommended for its performance benefits. Additional features like double quantization via `bnb_4bit_use_double_quant` can save extra bits per parameter *(by enabling a second round of quantization to further compress the model)*, and the computation precision (`bnb_4bit_compute_dtype`) can be adjusted to balance between speed and resource usage.
+
+Advanced configurations, such as NF4 quantization with double quantization and altered compute dtype for faster training, are facilitated through the `BitsAndBytesConfig` class. For instance, configuring a model with NF4 quantization and bfloat16 compute dtype can be achieved as follows:
+
+```python linenums="1"
+# import 
+import torch
+from transformers import BitsAndBytesConfig, AutoModelForCausalLM
+# define config
+nf4_config = BitsAndBytesConfig(load_in_4bit=True, 
+                                bnb_4bit_quant_type="nf4", 
+                                bnb_4bit_use_double_quant=True, 
+                                bnb_4bit_compute_dtype=torch.bfloat16)
+# load model
+model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, 
+                    quantization_config=nf4_config)
+```
+
+!!! Hint
+    Use double quant only if you have problems with memory, use NF4 for higher precision, and use a 16-bit dtype for faster finetuning. 
+    [Refer](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
+
+!!! Hint
+    If your hardware supports it, `bf16` is the optimal compute dtype. The default is `float32` for backward compatibility and numerical stability. `float16` often leads to numerical instabilities, but `bfloat16` provides the benefits of both worlds: numerical stability equivalent to `float32`, but combined with the memory footprint and significant computation speedup of a 16-bit data type. Therefore, be sure to check if your hardware supports `bf16` and configure it using the `bnb_4bit_compute_dtype` parameter in `BitsAndBytesConfig`. 
+    [Refer](https://huggingface.co/docs/bitsandbytes/main/en/integrations)
+
+### GGML/GGUF
+
+GGUF [9] *(older version was called GGML)* is a file format developed by [Georgi Gerganov](https://github.com/ggerganov) specifically for the rapid loading and saving of models, along with its user-friendly approach to model reading. The format is designed to be a single-file deployment format *(as it contains all the necessary information for model loading)*, is compatible with memory-mapped files (mmap) and is extensible *(support legacy models)*. It was initially developed for [whisper.cpp](https://github.com/ggerganov/whisper.cpp) and later extended into [llama.cpp](https://github.com/ggerganov/llama.cpp). This made it the go to format for transformer models *(which is the back bone of all LLMs today)* and it is especially suited for anyone who wants to run model locally or on edge devices or use it for devices with limited GPU memory.
+
+That said, with advent of quantization for model compression especially for running LLMs on low memory devices, several quantization techniques were added into the `llama.cpp` package. Below is a table that summarizes the quantization techniques with more helpful practical details.
+
+| ID | Quantization Type | Size  | Perplexity Increase @ 7B | Quality Loss Level         | Recommendation         |
+|----|-------------------|-------|-------------------------|----------------------------|------------------------|
+| 2  | Q4_0              | 3.50G | +0.2499                 | Very high           | Legacy, use Q3_K_M     |
+| 3  | Q4_1              | 3.90G | +0.1846                 | Substantial         | Legacy, use Q3_K_L     |
+| 8  | Q5_0              | 4.30G | +0.0796                 | Balanced           | Legacy, use Q4_K_M     |
+| 9  | Q5_1              | 4.70G | +0.0415                 | Low                | Legacy, use Q5_K_M     |
+| 10 | Q2_K              | 2.67G | +0.8698                 | Extreme          | Not recommended        |
+| 11 | Q3_K_S            | 2.75G | +0.5505                 | Very high      |                        |
+| 12 | Q3_K_M or Q3_K           | 3.06G | +0.2437                 | Very high      |                        |
+| 13 | Q3_K_L            | 3.35G | +0.1803                 | Substantial         |                        |
+| 14 | Q4_K_S            | 3.56G | +0.1149                 | Significant         |                        |
+| 15 | Q4_K_M or Q4_K           | 3.80G | +0.0535                 | Balanced           | *Recommended*          |
+| 16 | Q5_K_S            | 4.33G | +0.0353                 | Low                 | *Recommended*          |
+| 17 | Q5_K_M or Q5_K           | 4.45G | +0.0142                 | Very low            | *Recommended*          |
+| 18 | Q6_K              | 5.15G | +0.0044                 | Extremely low  |                        |
+| 7  | Q8_0              | 6.70G | +0.0004                 | Extremely low  | Not recommended        |
+| 1  | F16               | 13.00G|                         | Virtually no | Not recommended     |
+| 0  | F32               | 26.00G|                         | Lossless | Not recommended        |
+
+!!! Hint
+    While selecting which quantization version to use, it is important to consider the trade-off between model size and quality. It also depends on the specific use case and the available resources. That said, if you are looking for something small and fast that does not compromise a lot on quality, `Q5_K_M` is a good choice.
+
+Furthermore, it's good to know the two types of quantization supported in `llama.cpp` - "type-0" where weights `w` are obtained from quants `q` using $w = d * q$, where `d` is the block scale and "type-1" where weights are given by $w = d * q + m$, where `m` is the block minimum. The naming convention of quantized model is `Q{bits}_K_{type}` or `Q{bits}_{type}` where `bits` is the number of bits, `type` is the type of quantization, and presence of `K` denotes that the new k-quant technique is used. The `S`, `M`, and `L` in the `type` are the size of the model where the `S` is the smallest, `M` is the medium, and `L` is the largest. This [PR comment](https://github.com/ggerganov/llama.cpp/pull/1684) provides further details on underlying techniques as follows,
+
+``` markdown
+The following new quantization types are added to ggml:
+
+GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
+GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
+GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
+GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
+GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
+GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
+
+
+This is exposed via llama.cpp quantization types that define various "quantization mixes" as follows:
+
+LLAMA_FTYPE_MOSTLY_Q2_K - uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
+LLAMA_FTYPE_MOSTLY_Q3_K_S - uses GGML_TYPE_Q3_K for all tensors
+LLAMA_FTYPE_MOSTLY_Q3_K_M - uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
+LLAMA_FTYPE_MOSTLY_Q3_K_L - uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
+LLAMA_FTYPE_MOSTLY_Q4_K_S - uses GGML_TYPE_Q4_K for all tensors
+LLAMA_FTYPE_MOSTLY_Q4_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
+LLAMA_FTYPE_MOSTLY_Q5_K_S - uses GGML_TYPE_Q5_K for all tensors
+LLAMA_FTYPE_MOSTLY_Q5_K_M - uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
+LLAMA_FTYPE_MOSTLY_Q6_K- uses 6-bit quantization (GGML_TYPE_Q8_K) for all tensors
+```
+
+
+Fortunately, it is common practice to quantize all variants before open sourcing them, as it is evident from any of the GGUF models uploaded by [TheBloke's collection](https://huggingface.co/TheBloke) in HuggingFace. 
+
+<figure markdown> 
+    ![](../imgs/ml_quantization_thebloke_llama.png)
+    <figcaption>LLaMa-2 GGUF version by TheBloke on HuggingFace contains all GGUF quantization versions. [Source](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main)</figcaption>
+</figure>
+
+Model can be loaded using the `ctransformers` library and additional details like which quantization version to load can be specified. Below is an example of how to load a model with `Q4_K_M` quantization version.
+
+
+```python linenums="1"
+## Install
+# Base ctransformers with no GPU acceleration
+pip install ctransformers>=0.2.24
+# Or with CUDA GPU acceleration
+pip install ctransformers[cuda]>=0.2.24
+# Or with ROCm GPU acceleration
+CT_HIPBLAS=1 pip install ctransformers>=0.2.24 --no-binary ctransformers
+# Or with Metal GPU acceleration for macOS systems
+CT_METAL=1 pip install ctransformers>=0.2.24 --no-binary ctransformers
+
+## Import
+from ctransformers import AutoModelForCausalLM
+
+## Load the model
+# Set gpu_layers to the number of layers to offload to GPU. 
+# Set to 0 if no GPU acceleration is available on your system.
+llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GGUF", 
+                    model_file="llama-2-7b-chat.q4_K_M.gguf",
+                    model_type="llama", gpu_layers=50)
+
+## Run inference
+print(llm("AI is going to"))
+```
+
+Fine-tuning the model can be done very easily using the `llama.cpp` library. Below is an example [9]
+
+```python linenums="1"
+# Install llama.cpp
+!git clone https://github.com/ggerganov/llama.cpp
+!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
+!pip install -r llama.cpp/requirements.txt
+
+# Manual - Download the model to quantize (`.bin` format) 
+
+# Convert to fp16 (as by default it is f32)
+!python llama.cpp/convert.py "pytorch_model-00001-of-00001.bin" --outtype f16 --outfile "pytorch_model.fp16.bin"
+
+# quantize
+!./llama.cpp/quantize "pytorch_model.fp16.bin" "pytorch_model.q5_k_m.gguf" "q5_k_m"
+```
+
 ## References
 
-[1] [A Survey of Quantization Methods for Efficient Neural Network Inference](https://arxiv.org/abs/2103.13630)
\ No newline at end of file
+[1] [A Survey of Quantization Methods for Efficient Neural Network Inference](https://arxiv.org/abs/2103.13630)
+
+[2] Maarten Grootendorst's Blog - [Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)](https://www.maartengrootendorst.com/blog/quantization/)
+
+[3] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - [Paper](https://arxiv.org/abs/2306.00978) | [Official Code](https://github.com/mit-han-lab/llm-awq)
+
+[4] AutoAWQ Github Repo - [Link](https://github.com/casper-hansen/AutoAWQ)
+
+[5] GPTQ - [Paper](https://arxiv.org/abs/2210.17323) | [Official Code](https://github.com/IST-DASLab/gptq)
+
+[6] AutoGPTQ Github Repo - [Link](https://github.com/AutoGPTQ/AutoGPTQ)
+
+[7] BitsAndBytes - [Official Doc](https://huggingface.co/docs/bitsandbytes/main/en/index) | [Support for 4-bit and QLora Blog](https://huggingface.co/blog/4bit-transformers-bitsandbytes) | [HuggingFace Integration Blog](https://huggingface.co/blog/hf-bitsandbytes-integration)
+
+[8] LLM.int8() - [Blog](https://huggingface.co/blog/hf-bitsandbytes-integration)
+
+[9] GGUF/GGML - [Official Docs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) | [Blog - Quantize Llama_2 models using GGML](https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html) | [K Quants](https://github.com/ggerganov/llama.cpp/pull/1684)
\ No newline at end of file