Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pruning support gpus #103

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions hw_acceleration.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -359,7 +359,7 @@ The statement "GPUs are less efficient than ASICs" could spark intense debate wi

Typically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.

However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.
However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, native support for quantization, native support for pruning which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.

Consequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.

Expand Down Expand Up @@ -433,7 +433,7 @@ CPUs lack the specialized architectures for massively parallel processing that G

##### Not Optimized for Data Parallelism

The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)).
The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI [@Sze2017-ak]. They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks ([AI Inference Acceleration on CPUs](https://www.intel.com/content/www/us/en/developer/articles/technical/ai-inference-acceleration-on-intel-cpus.html#gs.0w9qn2)). However, modern CPUs are equipped with vector instructions like [AVX-512](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-avx-512.html) specifically to accelerate certain key operations like matrix multiplication.

GPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.

Expand Down Expand Up @@ -506,8 +506,9 @@ The key goal is tailoring the hardware capabilities to match the algorithms and

The software stack can be optimized to better leverage the underlying hardware capabilities:

* **Model Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
* **Parallelism:** Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
* **Memory Optimization:** Tune data layouts to improve cache locality based on hardware profiling. This maximizes reuse and minimizes expensive DRAM access.
* **Compression:** Lerverage sparsity in the models to reduce storage space as well as save on computation by zero-skipping operations.
* **Custom Operations:** Incorporate specialized ops like low precision INT4 or bfloat16 into models to capitalize on dedicated hardware support.
* **Dataflow Mapping:** Explicitly map model stages to computational units to optimize data movement on hardware.

Expand All @@ -525,11 +526,11 @@ guide software optimizations, while algorithmic advances inform hardware
specialization. This mutual enhancement provides multiplicative
efficiency gains compared to isolated efforts.

#### Algorithm-Hardare Co-exploration
#### Algorithm-Hardware Co-exploration

Jointly exploring innovations in neural network architectures along with custom hardware design is a powerful co-design technique. This allows finding ideal pairings tailored to each other's strengths [@sze2017efficient].

For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support.
For instance, the shift to mobile architectures like MobileNets [@howard_mobilenets_2017] was guided by edge device constraints like model size and latency. The quantization [@jacob2018quantization] and pruning techniques [@gale2019state] that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support and pruning support [@mishrapruning].

Attention-based models have thrived on massively parallel GPUs and ASICs where their computation maps well spatially, as opposed to RNN architectures reliant on sequential processing. Co-evolution of algorithms and hardware unlocked new capabilities.

Expand Down
21 changes: 21 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -4669,3 +4669,24 @@ @article{zhuang2020comprehensive
number = 1,
pages = {43--76}
}

@article{mishrapruning,
author = {Asit K. Mishra and
Jorge Albericio Latorre and
Jeff Pool and
Darko Stosic and
Dusan Stosic and
Ganesh Venkatesh and
Chong Yu and
Paulius Micikevicius},
title = {Accelerating Sparse Deep Neural Networks},
journal = {CoRR},
volume = {abs/2104.08378},
year = {2021},
url = {https://arxiv.org/abs/2104.08378},
eprinttype = {arXiv},
eprint = {2104.08378},
timestamp = {Mon, 26 Apr 2021 17:25:10 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-08378.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}