Skip to content

Commit

Permalink
Making table format consistent
Browse files Browse the repository at this point in the history
  • Loading branch information
profvjreddi committed Nov 24, 2023
1 parent ed27595 commit 5ba5ccc
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 27 deletions.
12 changes: 6 additions & 6 deletions efficient_ai.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -101,12 +101,12 @@ Efficient numerics is not just about reducing the bit-width of numbers but under

| Precision | Pros | Cons |
|------------|-----------------------------------------------------------|--------------------------------------------------|
| **FP32** (Floating Point 32-bit) | - Standard precision used in most deep learning frameworks.<br> - High accuracy due to ample representational capacity.<br> - Well-suited for training. | - High memory usage.<br> - Slower inference times compared to quantized models.<br> - Higher energy consumption. |
| **FP16** (Floating Point 16-bit) | - Reduces memory usage compared to FP32.<br> - Speeds up computations on hardware that supports FP16.<br> - Often used in mixed-precision training to balance speed and accuracy. | - Lower representational capacity compared to FP32.<br> - Risk of numerical instability in some models or layers. |
| **INT8** (8-bit Integer) | - Significantly reduced memory footprint compared to floating-point representations.<br> - Faster inference if hardware supports INT8 computations.<br> - Suitable for many post-training quantization scenarios. | - Quantization can lead to some accuracy loss.<br> - Requires careful calibration during quantization to minimize accuracy degradation. |
| **INT4** (4-bit Integer) | - Even lower memory usage than INT8.<br> - Further speed-up potential for inference. | - Higher risk of accuracy loss compared to INT8.<br> - Calibration during quantization becomes more critical. |
| **Binary** | - Minimal memory footprint (only 1 bit per parameter).<br> - Extremely fast inference due to bitwise operations.<br> - Power efficient. | - Significant accuracy drop for many tasks.<br> - Complex training dynamics due to extreme quantization. |
| **Ternary** | - Low memory usage but slightly more than binary.<br> - Offers a middle ground between representation and efficiency. | - Accuracy might still be lower than higher precision models.<br> - Training dynamics can be complex. |
| **FP32** (Floating Point 32-bit) | Standard precision used in most deep learning frameworks<br> High accuracy due to ample representational capacity<br> Well-suited for training | High memory usage<br> Slower inference times compared to quantized models<br> Higher energy consumption |
| **FP16** (Floating Point 16-bit) | Reduces memory usage compared to FP32<br> Speeds up computations on hardware that supports FP16<br> Often used in mixed-precision training to balance speed and accuracy | Lower representational capacity compared to FP32<br> Risk of numerical instability in some models or layers |
| **INT8** (8-bit Integer) | Significantly reduced memory footprint compared to floating-point representations<br> Faster inference if hardware supports INT8 computations<br> Suitable for many post-training quantization scenarios | Quantization can lead to some accuracy loss<br> Requires careful calibration during quantization to minimize accuracy degradation |
| **INT4** (4-bit Integer) | Even lower memory usage than INT8<br> Further speed-up potential for inference | Higher risk of accuracy loss compared to INT8<br> Calibration during quantization becomes more critical |
| **Binary** | Minimal memory footprint (only 1 bit per parameter)<br> Extremely fast inference due to bitwise operations<br> Power efficient | Significant accuracy drop for many tasks<br> Complex training dynamics due to extreme quantization |
| **Ternary** | Low memory usage but slightly more than binary<br> Offers a middle ground between representation and efficiency | Accuracy might still be lower than higher precision models<br> Training dynamics can be complex |

### Efficiency Benefits

Expand Down
8 changes: 4 additions & 4 deletions hw_acceleration.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -444,10 +444,10 @@ While suitable for intermittent inference, sustaining near-peak throughput for t

| Accelerator | Description | Key Advantages | Key Disadvantages |
|-|-|-|-|
| ASICs | Custom ICs designed for target workload like AI inference | - Maximizes perf/watt <br>- Optimized for tensor ops<br>- Low latency on-chip memory | - Fixed architecture lacks flexibility<br>- High NRE cost<br>- Long design cycles |
| FPGAs | Reconfigurable fabric with programmable logic and routing | - Flexible architecture<br>- Low latency memory access | - Lower perf/watt than ASICs<br>- Complex programming |
| GPUs | Originally for graphics, now used for neural network acceleration | - High throughput<br>- Parallel scalability<br>- Software ecosystem with CUDA | - Not as power efficient as ASICs <br>- Require high memory bandwidth |
| CPUs | General purpose processors | - Programmability<br>- Ubiquitous availability | - Lower performance for AI workloads |
| ASICs | Custom ICs designed for target workload like AI inference | Maximizes perf/watt. <br> Optimized for tensor ops<br> Low latency on-chip memory | Fixed architecture lacks flexibility<br> High NRE cost<br> Long design cycles |
| FPGAs | Reconfigurable fabric with programmable logic and routing | Flexible architecture<br> Low latency memory access | Lower perf/watt than ASICs<br> Complex programming |
| GPUs | Originally for graphics, now used for neural network acceleration | High throughput<br> Parallel scalability<br> Software ecosystem with CUDA | Not as power efficient as ASICs. <br> Require high memory bandwidth |
| CPUs | General purpose processors | Programmability<br> Ubiquitous availability | Lower performance for AI workloads |

In general, CPUs provide a readily available baseline, GPUs deliver broadly accessible acceleration, FPGAs offer programmability, and ASICs maximize efficiency for fixed functions. The optimal choice depends on the scale, cost, flexibility and other requirements of the target application.

Expand Down
8 changes: 4 additions & 4 deletions ondevice_learning.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -508,7 +508,7 @@ By understanding the potential risks and implementing these defenses, we can hel
| **Method** | Insert malicious examples into training data, often with incorrect labels | Add carefully crafted noise to input data |
| **Example** | Adding images of cats labeled as dogs to a dataset used for training an image classification model | Adding a small amount of noise to an image in a way that causes a face recognition system to misidentify a person |
| **Potential Effects** | Model learns incorrect patterns and makes incorrect predictions | Immediate and potentially dangerous incorrect predictions |
| **Applications Affected** | Any ML model | Autonomous vehicles, security systems, etc. |
| **Applications Affected** | Any ML model | Autonomous vehicles, security systems, etc |

### Model Inversion

Expand Down Expand Up @@ -626,9 +626,9 @@ Here is a table summarizing the key similarities and differences between the Tin

| Framework | Similarities | Differences |
|-|-|-|
| Tiny Training Engine | - On-device training <br>- Optimize memory & computation <br>- Leverage pruning, sparsity, etc. | - Traces forward & backward graphs <br>- Prunes frozen weights <br>- Interleaves backprop & gradients <br>- Code generation|
| TinyTL | - On-device training <br>- Optimize memory & computation <br>- Leverage freezing, sparsity, etc. | - Freezes most weights <br>- Only adapts biases <br>- Uses residual model |
| TinyTrain | - On-device training <br>- Optimize memory & computation <br>- Leverage sparsity, etc. | - Meta-training in pretraining <br>- Task-adaptive sparse updating <br>- Selective layer updating |
| Tiny Training Engine | On-device training <br> Optimize memory & computation <br> Leverage pruning, sparsity, etc | Traces forward & backward graphs <br> Prunes frozen weights <br> Interleaves backprop & gradients <br> Code generation|
| TinyTL | On-device training <br> Optimize memory & computation <br> Leverage freezing, sparsity, etc | Freezes most weights <br> Only adapts biases <br> Uses residual model |
| TinyTrain | On-device training <br> Optimize memory & computation <br> Leverage sparsity, etc | Meta-training in pretraining <br> Task-adaptive sparse updating <br> Selective layer updating |

## Conclusion

Expand Down
26 changes: 13 additions & 13 deletions optimizations.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -128,13 +128,13 @@ The following compact table provides a concise comparison between structured and

| **Aspect** | **Structured Pruning** | **Unstructured Pruning** |
|------------------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| **Definition** | Pruning entire structures (e.g., neurons, channels, layers) within the network. | Pruning individual weights or neurons, resulting in sparse matrices or non-regular network structures. |
| **Model Regularity** | Maintains a regular, structured network architecture. | Results in irregular, sparse network architectures. |
| **Compression Level** | May offer limited model compression compared to unstructured pruning. | Can achieve higher model compression due to fine-grained pruning. |
| **Computational Efficiency** | Typically more computationally efficient due to maintaining regular structures. | Can be computationally inefficient due to sparse weight matrices, unless specialized hardware/software is used. |
| **Hardware Compatibility** | Generally better compatible with various hardware due to regular structures. | May require hardware that efficiently handles sparse computations to realize benefits. |
| **Implementation Complexity**| Often simpler to implement and manage due to maintaining network structure. | Can be complex to manage and compute due to sparse representations. |
| **Fine-Tuning Complexity** | May require less complex fine-tuning strategies post-pruning. | Might necessitate more complex retraining or fine-tuning strategies post-pruning. |
| **Definition** | Pruning entire structures (e.g., neurons, channels, layers) within the network | Pruning individual weights or neurons, resulting in sparse matrices or non-regular network structures |
| **Model Regularity** | Maintains a regular, structured network architecture | Results in irregular, sparse network architectures |
| **Compression Level** | May offer limited model compression compared to unstructured pruning | Can achieve higher model compression due to fine-grained pruning |
| **Computational Efficiency** | Typically more computationally efficient due to maintaining regular structures | Can be computationally inefficient due to sparse weight matrices, unless specialized hardware/software is used |
| **Hardware Compatibility** | Generally better compatible with various hardware due to regular structures | May require hardware that efficiently handles sparse computations to realize benefits |
| **Implementation Complexity**| Often simpler to implement and manage due to maintaining network structure | Can be complex to manage and compute due to sparse representations |
| **Fine-Tuning Complexity** | May require less complex fine-tuning strategies post-pruning | Might necessitate more complex retraining or fine-tuning strategies post-pruning |

![A visualization showing the differences and examples between unstructured and structured pruning. Observe that unstructured pruning can lead to models that no longer obey high-level structural guaruntees of their original unpruned counterparts: the left network is no longer a fully connected network after pruning. Structured pruning on the other hand maintains those invariants: in the middle, the fully connected network is pruned in a way that the pruned network is still fully connected; likewise, the CNN maintains its convolutional structure, albeit with fewer filters (@qi_efficient_2021).](images/modeloptimization_pruning_comparison.png)

Expand Down Expand Up @@ -318,12 +318,12 @@ Precision, delineating the exactness with which a number is represented, bifurca

| **Precision** | **Pros** | **Cons** |
|---------------------------------------|------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| **FP32** (Floating Point 32-bit) | Standard precision used in most deep learning frameworks.<br> High accuracy due to ample representational capacity.<br> Well-suited for training. | High memory usage.<br> Slower inference times compared to quantized models.<br> Higher energy consumption. |
| **FP16** (Floating Point 16-bit) | Reduces memory usage compared to FP32.<br> Speeds up computations on hardware that supports FP16.<br> Often used in mixed-precision training to balance speed and accuracy. | Lower representational capacity compared to FP32.<br> Risk of numerical instability in some models or layers. |
| **INT8** (8-bit Integer) | Significantly reduced memory footprint compared to floating-point representations.<br> Faster inference if hardware supports INT8 computations.<br> Suitable for many post-training quantization scenarios. | Quantization can lead to some accuracy loss.<br> Requires careful calibration during quantization to minimize accuracy degradation. |
| **INT4** (4-bit Integer) | Even lower memory usage than INT8.<br> Further speed-up potential for inference. | Higher risk of accuracy loss compared to INT8.<br> Calibration during quantization becomes more critical. |
| **Binary** | Minimal memory footprint (only 1 bit per parameter).<br> Extremely fast inference due to bitwise operations.<br> Power efficient. | Significant accuracy drop for many tasks.<br> Complex training dynamics due to extreme quantization. |
| **Ternary** | Low memory usage but slightly more than binary.<br> Offers a middle ground between representation and efficiency. | Accuracy might still be lower than higher precision models.<br> Training dynamics can be complex. |
| **FP32** (Floating Point 32-bit) | Standard precision used in most deep learning frameworks<br> High accuracy due to ample representational capacity<br> Well-suited for training | High memory usage<br> Slower inference times compared to quantized models<br> Higher energy consumption. |
| **FP16** (Floating Point 16-bit) | Reduces memory usage compared to FP32<br> Speeds up computations on hardware that supports FP16<br> Often used in mixed-precision training to balance speed and accuracy | Lower representational capacity compared to FP32<br> Risk of numerical instability in some models or layers. |
| **INT8** (8-bit Integer) | Significantly reduced memory footprint compared to floating-point representations<br> Faster inference if hardware supports INT8 computations<br> Suitable for many post-training quantization scenarios | Quantization can lead to some accuracy loss<br> Requires careful calibration during quantization to minimize accuracy degradation |
| **INT4** (4-bit Integer) | Even lower memory usage than INT8<br> Further speed-up potential for inference | Higher risk of accuracy loss compared to INT8<br> Calibration during quantization becomes more critical. |
| **Binary** | Minimal memory footprint (only 1 bit per parameter)<br> Extremely fast inference due to bitwise operations<br> Power efficient | Significant accuracy drop for many tasks<br> Complex training dynamics due to extreme quantization. |
| **Ternary** | Low memory usage but slightly more than binary<br> Offers a middle ground between representation and efficiency | Accuracy might still be lower than higher precision models<br> Training dynamics can be complex. |


#### Numeric Encoding and Storage
Expand Down

0 comments on commit 5ba5ccc

Please sign in to comment.