Skip to content

Commit

Permalink
Merge branch 'pruning-support-gpus' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
profvjreddi committed Dec 9, 2023
2 parents b5459ec + 7b03f43 commit 0a5dc41
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 5 deletions.
11 changes: 6 additions & 5 deletions hw_acceleration.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -359,7 +359,7 @@ The statement "GPUs are less efficient than ASICs" could spark intense debate wi

Typically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.

However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.
However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, native support for quantization, native support for pruning which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.

Consequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.

Expand Down Expand Up @@ -433,7 +433,7 @@ CPUs lack the specialized architectures for massively parallel processing that G

##### Not Optimized for Data Parallelism

The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)).
The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)). However, modern CPUs are equipped with vector instructions like [AVX-512](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-avx-512.html) specifically to accelerate certain key operations like matrix multiplication.

GPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.

Expand Down Expand Up @@ -506,8 +506,9 @@ The key goal is tailoring the hardware capabilities to match the algorithms and

The software stack can be optimized to better leverage the underlying hardware capabilities:

* **Model Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
* **Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
* **Memory Optimization:** Tune data layouts to improve cache locality based on hardware profiling. This maximizes reuse and minimizes expensive DRAM access.
* **Compression:** Lerverage sparsity in the models to reduce storage space as well as save on computation by zero-skipping operations.
* **Custom Operations:** Incorporate specialized ops like low precision INT4 or bfloat16 into models to capitalize on dedicated hardware support.
* **Dataflow Mapping:** Explicitly map model stages to computational units to optimize data movement on hardware.

Expand All @@ -525,11 +526,11 @@ guide software optimizations, while algorithmic advances inform hardware
specialization. This mutual enhancement provides multiplicative
efficiency gains compared to isolated efforts.

#### Algorithm-Hardare Co-exploration
#### Algorithm-Hardware Co-exploration

Jointly exploring innovations in neural network architectures along with custom hardware design is a powerful co-design technique. This allows finding ideal pairings tailored to each other's strengths [@sze2017efficient].

For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support.
For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support and pruning support [@mishrapruning].

Attention-based models have thrived on massively parallel GPUs and ASICs where their computation maps well spatially, as opposed to RNN architectures reliant on sequential processing. Co-evolution of algorithms and hardware unlocked new capabilities.

Expand Down
21 changes: 21 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -6818,3 +6818,24 @@ @article{zhuang2020comprehensive
volume = {109},
year = {2021}
}

@article{mishrapruning,
author = {Asit K. Mishra and
Jorge Albericio Latorre and
Jeff Pool and
Darko Stosic and
Dusan Stosic and
Ganesh Venkatesh and
Chong Yu and
Paulius Micikevicius},
title = {Accelerating Sparse Deep Neural Networks},
journal = {CoRR},
volume = {abs/2104.08378},
year = {2021},
url = {https://arxiv.org/abs/2104.08378},
eprinttype = {arXiv},
eprint = {2104.08378},
timestamp = {Mon, 26 Apr 2021 17:25:10 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-08378.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

0 comments on commit 0a5dc41

Please sign in to comment.