Merge branch 'pruning-support-gpus' into main

harvard-edge · Dec 9, 2023 · 0a5dc41 · 0a5dc41
2 parents b5459ec + 7b03f43
commit 0a5dc41
Show file tree

Hide file tree

Showing 2 changed files with 27 additions and 5 deletions.
diff --git a/hw_acceleration.qmd b/hw_acceleration.qmd
@@ -359,7 +359,7 @@ The statement "GPUs are less efficient than ASICs" could spark intense debate wi
 
 Typically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.
 
-However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.
+However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, native support for quantization, native support for pruning which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.
 
 Consequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.
 
@@ -433,7 +433,7 @@ CPUs lack the specialized architectures for massively parallel processing that G
 
 ##### Not Optimized for Data Parallelism
 
-The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)).
+The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)). However, modern CPUs are equipped with vector instructions like [AVX-512](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-avx-512.html) specifically to accelerate certain key operations like matrix multiplication.
 
 GPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.
 
@@ -506,8 +506,9 @@ The key goal is tailoring the hardware capabilities to match the algorithms and
 
 The software stack can be optimized to better leverage the underlying hardware capabilities:
 
-* **Model Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
+* **Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
 * **Memory Optimization:** Tune data layouts to improve cache locality based on hardware profiling. This maximizes reuse and minimizes expensive DRAM access.
+* **Compression:** Lerverage sparsity in the models to reduce storage space as well as save on computation by zero-skipping operations.
 * **Custom Operations:** Incorporate specialized ops like low precision INT4 or bfloat16 into models to capitalize on dedicated hardware support.
 * **Dataflow Mapping:** Explicitly map model stages to computational units to optimize data movement on hardware.
 
@@ -525,11 +526,11 @@ guide software optimizations, while algorithmic advances inform hardware
 specialization. This mutual enhancement provides multiplicative
 efficiency gains compared to isolated efforts.
 
-#### Algorithm-Hardare Co-exploration
+#### Algorithm-Hardware Co-exploration
 
 Jointly exploring innovations in neural network architectures along with custom hardware design is a powerful co-design technique. This allows finding ideal pairings tailored to each other's strengths [@sze2017efficient].
 
-For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support.
+For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support and pruning support [@mishrapruning].
 
 Attention-based models have thrived on massively parallel GPUs and ASICs where their computation maps well spatially, as opposed to RNN architectures reliant on sequential processing. Co-evolution of algorithms and hardware unlocked new capabilities.
 

diff --git a/references.bib b/references.bib
@@ -6818,3 +6818,24 @@ @article{zhuang2020comprehensive
     volume = {109},
     year = {2021}
 }
+
+@article{mishrapruning,
+  author       = {Asit K. Mishra and
+                  Jorge Albericio Latorre and
+                  Jeff Pool and
+                  Darko Stosic and
+                  Dusan Stosic and
+                  Ganesh Venkatesh and
+                  Chong Yu and
+                  Paulius Micikevicius},
+  title        = {Accelerating Sparse Deep Neural Networks},
+  journal      = {CoRR},
+  volume       = {abs/2104.08378},
+  year         = {2021},
+  url          = {https://arxiv.org/abs/2104.08378},
+  eprinttype    = {arXiv},
+  eprint       = {2104.08378},
+  timestamp    = {Mon, 26 Apr 2021 17:25:10 +0200},
+  biburl       = {https://dblp.org/rec/journals/corr/abs-2104-08378.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}