From 9ae33f445b84ba9c7c8eebbcc2d28b8aaafbf3c9 Mon Sep 17 00:00:00 2001
From: srivatsankrishnan <91.srivatsan@gmail.com>
Date: Sat, 9 Dec 2023 07:41:32 -0800
Subject: [PATCH 1/4] adding references and fixes wrt to CPUs and GPUs

---
 hw_acceleration.qmd | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw_acceleration.qmd b/hw_acceleration.qmd
index e751bddc..62589dc5 100644
--- a/hw_acceleration.qmd
+++ b/hw_acceleration.qmd
@@ -359,7 +359,7 @@ The statement "GPUs are less efficient than ASICs" could spark intense debate wi
 
 Typically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.
 
-However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.
+However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, native support for quantization, native support for pruning which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.
 
 Consequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.
 
@@ -433,7 +433,7 @@ CPUs lack the specialized architectures for massively parallel processing that G
 
 ##### Not Optimized for Data Parallelism
 
-The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)).
+The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)). However, modern CPUs come with vector instructions like [AVX-512](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-avx-512.html) specifically to accelerate certain key operations like matrix multiplication.
 
 GPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.
 
@@ -506,8 +506,9 @@ The key goal is tailoring the hardware capabilities to match the algorithms and
 
 The software stack can be optimized to better leverage the underlying hardware capabilities:
 
-* **Model Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
+* **Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
 * **Memory Optimization:** Tune data layouts to improve cache locality based on hardware profiling. This maximizes reuse and minimizes expensive DRAM access.
+* **Compression:** Lerverage sparsity in the models to reduce storage space as well as save on computation by zero-skipping operations.
 * **Custom Operations:** Incorporate specialized ops like low precision INT4 or bfloat16 into models to capitalize on dedicated hardware support.
 * **Dataflow Mapping:** Explicitly map model stages to computational units to optimize data movement on hardware.
 
@@ -1005,4 +1006,4 @@ We also explored the role of software in actively enabling and optimizing AI acc
 
 But there is so much more to come! Exciting frontiers like analog computing, optical neural networks, and quantum machine learning represent active research directions that could unlock orders of magnitude improvements in efficiency, speed, and scale compared to present paradigms.
 
-In the end, specialized hardware acceleration remains indispensable for unlocking the performance and efficiency necessary to fulfill the promise of artificial intelligence from cloud to edge. We hope this chapter actively provided useful background and insights into the rapid innovation occurring in this domain.
+In the end, specialized hardware acceleration remains indispensable for unlocking the performance and efficiency necessary to fulfill the promise of artificial intelligence from cloud to edge. We hope this chapter actively provided useful background and insights into the rapid innovation occurring in this domain.
\ No newline at end of file

From b1e8d6b5214df8b3fb33911fa6f093293f4022c6 Mon Sep 17 00:00:00 2001
From: srivatsankrishnan <91.srivatsan@gmail.com>
Date: Sat, 9 Dec 2023 07:46:12 -0800
Subject: [PATCH 2/4] minor fix on CPUs

---
 hw_acceleration.qmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw_acceleration.qmd b/hw_acceleration.qmd
index 62589dc5..e11bfbe3 100644
--- a/hw_acceleration.qmd
+++ b/hw_acceleration.qmd
@@ -433,7 +433,7 @@ CPUs lack the specialized architectures for massively parallel processing that G
 
 ##### Not Optimized for Data Parallelism
 
-The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)). However, modern CPUs come with vector instructions like [AVX-512](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-avx-512.html) specifically to accelerate certain key operations like matrix multiplication.
+The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)). However, modern CPUs are equipped with vector instructions like [AVX-512](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-avx-512.html) specifically to accelerate certain key operations like matrix multiplication.
 
 GPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.
 
@@ -1006,4 +1006,4 @@ We also explored the role of software in actively enabling and optimizing AI acc
 
 But there is so much more to come! Exciting frontiers like analog computing, optical neural networks, and quantum machine learning represent active research directions that could unlock orders of magnitude improvements in efficiency, speed, and scale compared to present paradigms.
 
-In the end, specialized hardware acceleration remains indispensable for unlocking the performance and efficiency necessary to fulfill the promise of artificial intelligence from cloud to edge. We hope this chapter actively provided useful background and insights into the rapid innovation occurring in this domain.
\ No newline at end of file
+In the end, specialized hardware acceleration remains indispensable for unlocking the performance and efficiency necessary to fulfill the promise of artificial intelligence from cloud to edge. We hope this chapter actively provided useful background and insights into the rapid innovation occurring in this domain.

From 4c1a3ae631f38d14cd0c1d30737cc07e16f88ff6 Mon Sep 17 00:00:00 2001
From: srivatsankrishnan <91.srivatsan@gmail.com>
Date: Sat, 9 Dec 2023 08:01:53 -0800
Subject: [PATCH 3/4] adding references for 2:4 pruning

---
 hw_acceleration.qmd |  4 ++--
 references.bib      | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/hw_acceleration.qmd b/hw_acceleration.qmd
index e11bfbe3..e47b7451 100644
--- a/hw_acceleration.qmd
+++ b/hw_acceleration.qmd
@@ -526,11 +526,11 @@ guide software optimizations, while algorithmic advances inform hardware
 specialization. This mutual enhancement provides multiplicative
 efficiency gains compared to isolated efforts.
 
-#### Algorithm-Hardare Co-exploration
+#### Algorithm-Hardware Co-exploration
 
 Jointly exploring innovations in neural network architectures along with custom hardware design is a powerful co-design technique. This allows finding ideal pairings tailored to each other's strengths [@sze2017efficient].
 
-For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support.
+For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support and pruning support[@mishrapruning].
 
 Attention-based models have thrived on massively parallel GPUs and ASICs where their computation maps well spatially, as opposed to RNN architectures reliant on sequential processing. Co-evolution of algorithms and hardware unlocked new capabilities.
 
diff --git a/references.bib b/references.bib
index 345ef071..edc35982 100644
--- a/references.bib
+++ b/references.bib
@@ -4669,3 +4669,24 @@ @article{zhuang2020comprehensive
 	number       = 1,
 	pages        = {43--76}
 }
+
+@article{mishrapruning,
+  author       = {Asit K. Mishra and
+                  Jorge Albericio Latorre and
+                  Jeff Pool and
+                  Darko Stosic and
+                  Dusan Stosic and
+                  Ganesh Venkatesh and
+                  Chong Yu and
+                  Paulius Micikevicius},
+  title        = {Accelerating Sparse Deep Neural Networks},
+  journal      = {CoRR},
+  volume       = {abs/2104.08378},
+  year         = {2021},
+  url          = {https://arxiv.org/abs/2104.08378},
+  eprinttype    = {arXiv},
+  eprint       = {2104.08378},
+  timestamp    = {Mon, 26 Apr 2021 17:25:10 +0200},
+  biburl       = {https://dblp.org/rec/journals/corr/abs-2104-08378.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}

From 7b03f432f794b3789e56dcb89a9150b57f97484a Mon Sep 17 00:00:00 2001
From: Vijay Janapa Reddi <vj@eecs.harvard.edu>
Date: Sat, 9 Dec 2023 14:41:07 -0500
Subject: [PATCH 4/4] Minor spacing fix before reference

---
 hw_acceleration.qmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw_acceleration.qmd b/hw_acceleration.qmd
index e47b7451..302284e8 100644
--- a/hw_acceleration.qmd
+++ b/hw_acceleration.qmd
@@ -530,7 +530,7 @@ efficiency gains compared to isolated efforts.
 
 Jointly exploring innovations in neural network architectures along with custom hardware design is a powerful co-design technique. This allows finding ideal pairings tailored to each other's strengths [@sze2017efficient].
 
-For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support and pruning support[@mishrapruning].
+For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support and pruning support [@mishrapruning].
 
 Attention-based models have thrived on massively parallel GPUs and ASICs where their computation maps well spatially, as opposed to RNN architectures reliant on sequential processing. Co-evolution of algorithms and hardware unlocked new capabilities.