diff --git a/references.html b/references.html
index 0584fd80..6f149ddc 100644
--- a/references.html
+++ b/references.html
@@ -1086,6 +1086,13 @@ References
“Self-Supervised Learning in Medicine and Healthcare.”
Nature Biomedical Engineering 6 (12): 1346–52. https://doi.org/10.1038/s41551-022-00914-1.
+
+Krishnan, Srivatsan, Amir Yazdanbakhsh, Shvetank Prakash, Jason Jabbour,
+Ikechukwu Uchendu, Susobhan Ghosh, Behzad Boroujerdian, et al. 2023.
+“ArchGym: An Open-Source Gymnasium for Machine Learning Assisted
+Architecture Design.” In Proceedings of the 50th Annual
+International Symposium on Computer Architecture, 1–16.
+
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.
“Imagenet Classification with Deep Convolutional Neural
diff --git a/search.json b/search.json
index e2d1cae2..6eb87ee0 100644
--- a/search.json
+++ b/search.json
@@ -25,7 +25,7 @@
"href": "contributors.html",
"title": "Contributors",
"section": "",
- "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, and time to enhance both the content and codebase of this project. Below you will find a list of all contributors. If you would like to contribute to this project, please see our GitHub page.\n\n\n\n\n\n\n\n\n\n\nJessica Quaye\n\n\nMarcelo Rovai\n\n\nhappyappledog\n\n\nJared Ni\n\n\nishapira\n\n\n\n\nShvetank Prakash\n\n\nIkechukwu Uchendu\n\n\nHenry Bae\n\n\nPong Trairatvorakul\n\n\naptl26\n\n\n\n\nnaeemkh\n\n\nalxrod\n\n\nColby Banbury\n\n\nJayson Lin\n\n\nJennifer Zhou\n\n\n\n\nsjohri20\n\n\nEric D\n\n\nVijay Janapa Reddi\n\n\nEmil Njor\n\n\nMark Mazumder\n\n\n\n\noishib\n\n\nJeffrey Ma\n\n\nAditiR_42\n\n\nMichael Schnebly\n\n\nMatthew Stewart\n\n\n\n\narnaumarin\n\n\nDivya\n\n\nMarco Zennaro\n\n\nsophiacho1"
+ "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, and time to enhance both the content and codebase of this project. Below you will find a list of all contributors. If you would like to contribute to this project, please see our GitHub page.\n\n\n\n\n\n\n\n\n\n\nJennifer Zhou\n\n\nHenry Bae\n\n\nColby Banbury\n\n\nsjohri20\n\n\nJessica Quaye\n\n\n\n\nMarcelo Rovai\n\n\nsophiacho1\n\n\narnaumarin\n\n\nAditiR_42\n\n\nMark Mazumder\n\n\n\n\nMatthew Stewart\n\n\nnaeemkh\n\n\nIkechukwu Uchendu\n\n\nMichael Schnebly\n\n\nalxrod\n\n\n\n\nJeffrey Ma\n\n\nPong Trairatvorakul\n\n\nJared Ni\n\n\nhappyappledog\n\n\noishib\n\n\n\n\nEric D\n\n\nJayson Lin\n\n\naptl26\n\n\nishapira\n\n\nShvetank Prakash\n\n\n\n\nDivya\n\n\nVijay Janapa Reddi\n\n\nMarco Zennaro\n\n\nEmil Njor"
},
{
"objectID": "copyright.html",
@@ -627,7 +627,7 @@
"href": "hw_acceleration.html#sec-aihw",
"title": "11 AI Acceleration",
"section": "11.3 Accelerator Types",
- "text": "11.3 Accelerator Types\nHardware accelerators can take on many forms. They can exist as a widget (like the Neural Engine in the Apple M1 chip) or as entire chips specially designed to perform certain tasks very well. In this section, we will examine processors for machine learning workloads along the spectrum from highly specialized ASICs to more general-purpose CPUs. We first focus on custom hardware purpose-built for AI to understand the most extreme optimizations possible when design constraints are removed. This establishes a ceiling for performance and efficiency.\nWe then progressively consider more programmable and adaptable architectures with discussions of GPUs and FPGAs. These make tradeoffs in customization to maintain flexibility. Finally, we cover general-purpose CPUs which sacrifice optimizations for a particular workload in exchange for versatile programmability across applications.\nBy structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs in accelerator design between utilization, efficiency, programmability, and flexibility. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization.\nThe progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space.\n This graph illustrates the complex interplay between flexibility, performance, functional diversity, and area for various types of hardware processors. As the area of a processor increases, so does its potential for functional diversity and flexibility. However, a design could dedicate that additional area to target application specific tasks. Performance will always depend on how and how effectively that area is utilized. Optimal design always requires balancing these factors according to a hierarchy of application requirements.\n\n11.3.1 Application-Specific Integrated Circuits (ASICs)\nAn Application-Specific Integrated Circuit (ASIC) is a type of integrated circuit (IC) that is custom-designed for a specific application or workload, rather than for general-purpose use. Unlike CPUs and GPUs, ASICs do not support multiple applications or workloads. Rather, they are optimized to perform a single task extremely efficiently. Apple’s M1/2/3, AMD’s Neoverse, Intel’s i5/7/9, Google’s TPUs, and NVIDIA’s GPUs are all examples of ASICs.\nASICs achieve this efficiency by tailoring every aspect of the chip design - the underlying logic gates, electronic components, architecture, memory, I/O, and manufacturing process - specifically for the target application. This level of customization allows removing any unnecessary logic or functionality required for general computation. The result is an IC that maximizes performance and power efficiency on the desired workload. The efficiency gains from application-specific hardware are so substantial that these software-centric firms are dedicating enormous engineering resources to designing customized ASICs.\nThe rise of more complex machine learning algorithms has made the performance advantages enabled by tailored hardware acceleration a key competitive differentiator, even for companies traditionally concentrated on software engineering. ASICs have become a high-priority investment for major cloud providers aiming to offer faster AI computation.\n\nAdvantages\nASICs provide significant benefits over general purpose processors like CPUs and GPUs due to their customized nature. The key advantages include the following.\n\nMaximized Performance and Efficiency\nThe most fundamental advantage of ASICs is the ability to maximize performance and power efficiency by customizing the hardware architecture specifically for the target application. Every transistor and design aspect is optimized for the desired workload - no unnecessary logic or overhead is needed to support generic computation.\nFor example, Google’s Tensor Processing Units (TPUs) contain architectures tailored exactly for the matrix multiplication operations used in neural networks. To design the TPU ASICs, Google’s engineering teams need to clearly define the chip specifications, write the architecture description using Hardware Description Languages like Verilog, synthesize the design to map it to hardware components, and carefully place-and-route transistors and wires based on the fabrication process design rules. This complex design process, known as very-large-scale integration (VLSI), allows them to build an IC optimized just for machine learning workloads.\nAs a result, TPU ASICs achieve over an order of magnitude higher efficiency in operations per watt than general purpose GPUs on ML workloads by maximizing performance and minimizing power consumption through a full-stack custom hardware design.\n\n\nSpecialized On-Chip Memory\nASICs incorporate on-chip SRAM and caches specifically optimized to feed data to the computational units. For example, Apple’s M1 system-on-a-chip contains special low-latency SRAM to accelerate the performance of its Neural Engine machine learning hardware. Large local memory with high bandwidth enables keeping data as close as possible to the processing elements. This provides tremendous speed advantages compared to off-chip DRAM access, which is up to 100x slower.\nData locality and optimizing memory hierarchy is crucial for both high throughput and low power.Below is a table “Numbers Everyone Should Know” from Jeff Dean.\n\n\n\nOperation\nLatency\nNotes\n\n\n\n\nL1 cache reference\n0.5 ns\n\n\n\nBranch mispredict\n5 ns\n\n\n\nL2 cache reference\n7 ns\n\n\n\nMutex lock/unlock\n25 ns\n\n\n\nMain memory reference\n100 ns\n\n\n\nCompress 1K bytes with Zippy\n3,000 ns\n3 μs\n\n\nSend 1 KB bytes over 1 Gbps network\n10,000 ns\n10 μs\n\n\nRead 4 KB randomly from SSD\n150,000 ns\n150 μs\n\n\nRead 1 MB sequentially from memory\n250,000 ns\n250 μs\n\n\nRound trip within same datacenter\n500,000 ns\n0.5 ms\n\n\nRead 1 MB sequentially from SSD\n1,000,000 ns\n1 ms\n\n\nDisk seek\n10,000,000 ns\n10 ms\n\n\nRead 1 MB sequentially from disk\n20,000,000 ns\n20 ms\n\n\nSend packet CA->Netherlands->CA\n150,000,000 ns\n150 ms\n\n\n\n\n\nCustom Datatypes and Operations\nUnlike general purpose processors, ASICs can be designed to natively support custom datatypes like INT4 or bfloat16 that are widely used in ML models. For instance, Nvidia’s Ampere GPU architecture has dedicated bfloat16 Tensor Cores to accelerate AI workloads. Low precision datatypes enable higher arithmetic density and performance. ASICs can also directly incorporate non-standard operations common in ML algorithms as primitive operations - for example, natively supporting activation functions like ReLU makes execution more efficient. We encourage you to refer to the Efficient Numeric Representations chapter for additional details.\n\n\nHigh Parallelism\nASIC architectures can leverage much higher parallelism tuned for the target workload versus general purpose CPUs or GPUs. More computational units tailored for the application means more operations execute simultaneously. Highly parallel ASICs achieve tremendous throughput for data parallel workloads like neural network inference.\n\n\nAdvanced Process Nodes\nCutting edge manufacturing processes allow packing more transistors into smaller die areas, increasing density. ASICs designed specifically for high volume applications can better amortize the costs of bleeding edge process nodes.\n\n\n\nDisadvatages\n\nLong Design Timelines\nThe engineering process of designing and validating an ASIC can take 2-3 years. Synthesizing the architecture using hardware description languages, taping out the chip layout, and fabricating the silicon on advanced process nodes involves long development cycles. For example, to tape out a 7nm chip, teams need to carefully define specifications, write the architecture in HDL, synthesize the logic gates, place components, route all interconnections, and finalize the layout to send for fabrication. This very large scale integration (VLSI) flow means ASIC design and manufacturing can traditionally take 2-5 years.\nThere are a few key reasons why the long design timelines of ASICs, often 2-3 years, can be challenging for machine learning workloads:\n\nML algorithms evolve rapidly: New model architectures, training techniques, and network optimizations are constantly emerging. For example, Transformers became hugely popular in NLP in just the last few years. By the time an ASIC finishes tapeout, the optimal architecture for a workload may have changed.\nDatasets grow quickly: ASICs designed for certain model sizes or datatypes can become undersized relative to demand. For instance, natural language models are scaling exponentially with more data and parameters. A chip designed for BERT might not accommodate GPT-3.\nML applications change frequently: The industry focus shifts between computer vision, speech, NLP, recommender systems etc. An ASIC optimized for image classification may have less relevance in a few years.\nFaster design cycles with GPUs/FPGAs: Programmable accelerators like GPUs can adapt much quicker by upgrading software libraries and frameworks. New algorithms can be deployed without hardware changes.\nTime-to-market needs: Getting a competitive edge in ML requires rapidly experimenting with new ideas and deploying them. Waiting several years for an ASIC is not aligned with fast iteration.\n\nThe pace of innovation in ML is not well matched to the multi-year timescale for ASIC development. Significant engineering efforts are required to extend ASIC lifespan through modular architectures, process scaling, model compression, and other techniques. But the rapid evolution of ML makes fixed function hardware challenging.\n\n\nHigh Non-Recurring Engineering Costs\nThe fixed costs of taking an ASIC from design to high volume manufacturing can be very capital intensive, often tens of millions of dollars. Photomask fabrication for taping out chips in advanced process nodes, packaging, and one-time engineering efforts are expensive. For instance, a 7nm chip tapeout alone could cost tens of millions of dollars. The high non-recurring engineering (NRE) investment narrows ASIC viability to high-volume production use cases where the upfront cost can be amortized.\n Table from Enabling Cheaper Design\n\n\nComplex Integration and Programming\nASICs require extensive software integration work including drivers, compilers, OS support, and debugging tools. They also need expertise in electrical and thermal packaging. Additionally, programming ASIC architectures efficiently can involve challenges like workload partitioning and scheduling across many parallel units. The customized nature necessitates significant integration efforts to turn raw hardware into fully operational accelerators.\nWhile ASICs provide massive efficiency gains on target applications by tailoring every aspect of the hardware design to one specific task, their fixed nature results in tradeoffs in flexibility and development costs compared to programmable accelerators, which must be weighed based on the application.\n\n\n\n\n11.3.2 Field-Programmable Gate Arrays (FPGAs)\nFPGAs are programmable integrated circuits that can be reconfigured for different applications. Their customizable nature provides advantages for accelerating AI algorithms compared to fixed ASICs or inflexible GPUs. While Google, Meta, and NVIDIA which are looking at putting ASICs in data centers, Microsoft deployed FPGAs in their data centers (Putnam et al. 2014) in 2011 to efficiently serve diverse data center workloads.\n\nAdvantages\nFPGAs provide several benefits over GPUs and ASICs for accelerating machine learning workloads.\n\nFlexibility Through Reconfigurable Fabric\nThe key advantage of FPGAs is the ability to reconfigure the underlying fabric to implement custom architectures optimized for different models, unlike fixed-function ASICs. For example, quant trading firms use FPGAs to accelerate their algorithms because they change frequently, and the low NRE cost of FPGAs is more viable than taping out new ASICs.\n Comparison of FPGAs on the market (Gwennap, n.d.)\nGwennap, Linley. n.d. “Certus-NX Innovates General-Purpose FPGAs.”\n\nFPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a base amount of these resources, and engineers program the chips by compiling HDL code into bitstreams that rearrange the fabric into different configurations. This makes FPGAs adaptable as algorithms evolve.\nWhile FPGAs may not achieve the utmost performance and efficiency of workload-specific ASICs, their programmability provides more flexibility as algorithms change. This adaptability makes FPGAs a compelling choice for accelerating evolving machine learning applications. For machine learning workloads, Microsoft has deployed FPGAs in its Azure data centers to serve diverse applications, instead of using ASICs. The programmability enables optimization across changing ML models.\n\n\nCustomized Parallelism and Pipelining\nFPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, Intel’s HARPv2 FPGA platform splits the layers of an MNIST convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model’s structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs.\n\n\nLow Latency On-Chip Memory\nLarge amounts of high bandwidth on-chip memory enables localized storage for weights and activations. For instance, Xilinx Versal FPGAs contain 32MB of low latency RAM blocks along with dual-channel DDR4 interfaces for external memory. Bringing memory physically closer to the compute units reduces access latency. This provides significant speed advantages over GPUs that must traverse PCIe or other system buses to reach off-chip GDDR6 memory.\n\n\nNative Support for Low Precision\nA key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16 used in quantized ML models. For example, Intel’s Stratix 10 NX FPGAs have dedicated INT8 cores that can achieve up to 143 INT8 TOPS at ~1 TOPS/W Intel® Stratix® 10 NX FPGA. Lower bit widths increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime.\n\n\n\nDisadvatages\n\nLower Peak Throughput than ASICs\nFPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 petaOps of INT8 performance while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS Intel® Stratix® 10 NX FPGA.\nThis is because FPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a set amount of these resources. To program FPGAs, engineers write HDL code and compile into bitstreams that rearrange the fabric, which has inherent overheads versus an ASIC purpose-built for one computation.\n\n\nProgramming Complexity\nTo optimize FPGA performance, engineers must program the architectures in low-level hardware description languages like Verilog or VHDL. This requires hardware design expertise and longer development cycles versus higher level software frameworks like TensorFlow. Maximizing utilization can be challenging despite advances in high-level synthesis from C/C++.\n\n\nReconfiguration Overheads\nTo change FPGA configurations requires reloading a new bitstream, which has considerable latency and storage size costs. For example, partial reconfiguration on Xilinx FPGAs can take 100s of milliseconds. This makes dynamically swapping architectures in real-time infeasible. The bitstream storage also consumes on-chip memory.\n\n\nDiminishing Gains on Advanced Nodes\nWhile smaller process nodes benefit ASICs greatly, they provide less advantages for FPGAs. At 7nm and below, effects like process variation, thermal constraints, and aging disproportionately impact FPGA performance. The overheads of configurable fabric also diminish gains vs fixed function ASICs.\n\n\nCase Study\nFPGAs have found widespread application in various fields, including medical imaging, robotics, and finance, where they excel in handling computationally intensive machine learning tasks. In the context of medical imaging, an illustrative example is the application of FPGAs for brain tumor segmentation, a traditionally time-consuming and error-prone process. For instance, Xiong et al. developed a quantized segmentation accelerator, which they retrained using the BraTS19 and BraTS20 datasets. Their work yielded remarkable results, achieving over 5x and 44x performance improvements, as well as 11x and 82x energy efficiency gains compared to GPU and CPU implementations, respectively (Xiong et al. 2021).\n\nXiong, Siyu, Guoqing Wu, Xitian Fan, Xuan Feng, Zhongcheng Huang, Wei Cao, Xuegong Zhou, et al. 2021. “MRI-Based Brain Tumor Segmentation Using FPGA-Accelerated Neural Network.” BMC Bioinformatics 22 (1): 421. https://doi.org/10.1186/s12859-021-04347-6.\n\n\n\n\n11.3.3 Digital Signal Processors (DSPs)\nThe first digital signal processor core was built in 1948 by Texas Instruments (“The Evolution of Audio DSPs”). Traditionally, DSPs would have logic to allow them to directly access digital/audio data in memory, perform an arithmetic operation (multiply-add-accumulate–MAC–was one of the most common operations) and then write the result back to memory. The DSP would also include specialized analog components to retrieve said digital/audio data.\nOnce we entered the smartphone era, DSPs started encompassing more sophisticated tasks. They required Bluetooth, Wi-Fi, and cellular connectivity. Media also became much more complex. Today, it’s not common to have entire chips dedicated to just DSP, but a System on Chip would include DSPs in addition to general-purpose CPUs. For example, Qualcomm’s Hexagon Digital Signal Processor claims to be a “world-class processor with both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modem functions.” Google Tensors, the chip in the Google Pixel phones, also includes both CPUs and specialized DSP engines.\n\nAdvatages\nDSPs architecturally provide advantages in vector math throughput, low latency memory access, power efficiency, and support for diverse datatypes - making them well-suited for embedded ML acceleration.\n\nOptimized Architecture for Vector Math\nDSPs contain specialized data paths, register files, and instructions optimized specifically for vector math operations commonly used in machine learning models. This includes dot product engines, MAC units, and SIMD capabilities tailored for vector/matrix calculations. For example, the CEVA-XM6 DSP (“Ceva SensPro Fuses AI and Vector DSP”) has 512-bit vector units to accelerate convolutions. This efficiency on vector math workloads is far beyond general CPUs.\n\n\nLow Latency On-Chip Memory\nDSPs integrate large amounts of fast on-chip SRAM memory to hold data locally for processing. Bringing memory physically closer to the computation units reduces access latency. For example, Analog’s SHARC+ DSP contains 10MB of on-chip SRAM. This high-bandwidth local memory provides speed advantages for real-time applications.\n\n\nPower Efficiency\nDSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, Qualcomm’s Hexagon DSP can deliver 4 trillion operations per second (TOPS) while consuming minimal watts.\n\n\nSupport for Integer and Floating Point Math\nUnlike GPUs which excel at single or half precision, DSPs can natively support both 8/16-bit integer and 32-bit floating point datatypes used across ML models. Some DSPs even support dot product acceleration at INT8 precision for quantized neural networks.\n\n\n\nDisadvatages\nDSPs make architectural tradeoffs that limit peak throughput, precision, and model capacity compared to other AI accelerators. But their advantages in power efficiency and integer math make them a strong edge compute option. So while DSPs provide some benefits over CPUs, they also come with limitations for machine learning workloads:\n\nLower Peak Throughput than ASICs/GPUs\nDSPs cannot match the raw computational throughput of GPUs or customized ASICs designed specifically for machine learning. For example, Qualcomm’s Cloud AI 100 ASIC delivers 480 TOPS on INT8, while their Hexagon DSP provides 10 TOPS. DSPs lack the massive parallelism of GPU SM units.\n\n\nSlower Double Precision Performance\nMost DSPs are not optimized for higher precision floating point needed in some ML models. Their dot product engines focus on INT8/16 and FP32 which provides better power efficiency. But 64-bit floating point throughput is much lower. This can limit usage in models requiring high precision.\n\n\nConstrained Model Capacity\nThe limited on-chip memory of DSPs constrains the model sizes that can be run. Large deep learning models with hundreds of megabytes of parameters would exceed on-chip SRAM capacity. DSPs are best suited for small to mid-sized models targeted for edge devices.\n\n\nProgramming Complexity\nEfficiently programming DSP architectures requires expertise in parallel programming and optimizing data access patterns. Their specialized microarchitectures have more learning curve than high-level software frameworks. This makes development more complex.\n\n\n\n\n11.3.4 Graphics Processing Units (GPUs)\nThe term graphics processing unit existed since at least the 1980s. There had always been a demand for graphics hardware in both video game consoles (high demand, needed to be relatively lower cost) and scientific simulations (lower demand, but needed higher resolution, could be at a high price point).\nThe term was popularized, however, in 1999 when NVIDIA launched the GeForce 256 mainly targeting the PC games market sector (Lindholm et al. 2008). As PC games became more sophisticated, NVIDIA GPUs became more programmable over time as well. Soon, users realized they could take advantage of this programmability and run a variety of non-graphics related workloads on GPUs and benefit from the underlying architecture. And so, starting in the late 2000s, GPUs became general-purpose graphics processing units or GP-GPUs.\n\nLindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008. “NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/MM.2008.31.\nIntel Arc Graphics and AMD Radeon RX have also developed their GPUs over time.\n\nAdvatages\n\nHigh Computational Throughput\nThe key advantage of GPUs is their ability to perform massively parallel floating point calculations optimized for computer graphics and linear algebra (Raina, Madhavan, and Ng 2009). Modern GPUs like Nvidia’s A100 offers up to 19.5 teraflops of FP32 performance with 6912 CUDA cores and 40GB of graphics memory that is tightly coupled with 1.6TB/s of graphics memory bandwidth.\n\nRaina, Rajat, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-Scale Deep Unsupervised Learning Using Graphics Processors.” In Proceedings of the 26th Annual International Conference on Machine Learning, 873–80. Montreal Quebec Canada: ACM. https://doi.org/10.1145/1553374.1553486.\nThis raw throughput stems from the highly parallel streaming multiprocessor (SM) architecture tailored for data-parallel workloads (Zhihao Jia, Zaharia, and Aiken 2019). Each SM contains hundreds of scalar cores optimized for float32/64 math. With thousands of SMs on chip, GPUs are purpose-built for matrix multiplication and vector operations used throughout neural networks.\nFor example, Nvidia’s latest H100 GPU provides 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32, 67 TFLOPs of FP32 and 34 TFLOPs of FP64 Compute performance, which can dramatically accelerate large batch training on models like BERT, GPT-3, and other transformer architectures. The scalable parallelism of GPUs is key to speeding up computationally intensive deep learning.\n\n\nMature Software Ecosystem\nNvidia provides extensive runtime libraries like cuDNN and cuBLAS that are highly optimized for deep learning primitives. Frameworks like TensorFlow and PyTorch integrate with these libraries to enable GPU acceleration with no direct programming. CUDA provides lower-level control for custom computations.\nThis ecosystem enables quickly leveraging GPUs via high-level Python without GPU programming expertise. Known workflows and abstractions provide a convenient on-ramp for scaling up deep learning experiments. The software maturity supplements the throughput advantages.\n\n\nBroad Availability\nThe economies of scale of graphics processing make GPUs broadly accessible in data centers, cloud platforms like AWS and GCP, and desktop workstations. Their availability in research environments has provided a convenient platform for ML experimentation and innovation. For example, nearly every state-of-the-art deep learning result has involved GPU acceleration because of this ubiquity. The broad access supplements the software maturity to make GPUs the standard ML accelerator.\n\n\nProgrammable Architecture\nWhile not fully flexible as FPGAs, GPUs do provide programmability via CUDA and shader languages to customize computations. Developers can optimize data access patterns, create new ops, and tune precisions for evolving models and algorithms.\n\n\n\nDisadvatages\nWhile GPUs have become the standard accelerator for deep learning, their architecture also comes with some key downsides.\n\nLess Efficient than Custom ASICs\nThe statement “GPUs are less efficient than ASICs” could spark intense debate within the ML/AI field and cause this book to explode 🤯.\nTypically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.\nHowever, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.\nConsequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.\n\n\nHigh Memory Bandwidth Needs\nThe massively parallel architecture requires tremendous memory bandwidth to supply thousands of cores as shown in Figure 1. For example, the Nvidia A100 GPU requires 1.6TB/sec to fully saturate its compute. GPUs rely on wide 384-bit memory buses to high bandwidth GDDR6 RAM, but even the fastest GDDR6 tops out around 1 TB/sec. This dependence on external DRAM incurs latency and power overheads.\n\n\nProgramming Complexity\nWhile tools like CUDA help, optimally mapping and partitioning ML workloads across the massively parallel GPU architecture remains challenging. Achieving both high utilization and memory locality requires low-level tuning (Zhe Jia et al. 2018). Abstractions like TensorFlow can leave performance on the table.\n\nJia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.” arXiv. http://arxiv.org/abs/1804.06826.\n\n\nLimited On-Chip Memory\nGPUs have relatively small on-chip memory caches compared to the large working set requirements of ML models during training. They are reliant on high bandwidth access to external DRAM, which ASICs minimize with large on-chip SRAM.\n\n\nFixed Architecture\nUnlike FPGAs, the fundamental GPU architecture cannot be altered post-manufacture. This constraint limits adapting to novel ML workloads or layers. The CPU-GPU boundary also creates data movement overheads.\n\n\n\nCase Study\nThe recent groundbreaking research conducted by OpenAI (Brown et al. 2020) with their GPT-3 model. GPT-3, a language model consisting of 175 billion parameters, demonstrated unprecedented language understanding and generation capabilities. Its training, which would have taken months on conventional CPUs, was accomplished in a matter of days using powerful GPUs, thus pushing the boundaries of natural language processing (NLP) capabilities.\n\n\n\n11.3.5 Central Processing Units (CPUs)\nThe term CPUs has a long history that dates back to 1955 (Weik 1955) while the first microprocessor CPU–the Intel 4004–was invented in 1971 (Who Invented the Microprocessor?). Compilers compile high-level programming languages like Python, Java, or C to assembly instructions (x86, ARM, RISC-V, etc.) for CPUs to process. The set of instructions a CPU understands is called the “instruction set” and must be agreed upon by both the hardware and software running atop it (See section 5 for a more in-depth description on instruction set architectures–ISAs).\n\nWeik, Martin H. 1955. A Survey of Domestic Electronic Digital Computing Systems. Ballistic Research Laboratories.\nAn overview of significant developments in CPUs:\n\nSingle-core Era (1950s- 2000): This era is known for seeing aggressive microarchitectural improvements. Techniques like speculative execution (executing an instruction before the previous one was done), out-of-order execution (re-ordering instructions to be more effective), and wider issue widths (executing multiple instructions at once) were implemented to increase instruction throughput. The term “System on Chip” also originated in this era as different analog components (components designed with transistors) and digital components (components designed with hardware description languages that are mapped to transistors) were put on the same platform to achieve some task.\nMulti-core Era (2000s): Driven by the decrease of Moore’s Law, this era is marked by scaling the number of cores within a CPU. Now tasks can be split across many different cores each with its own datapath and control unit. Many of the issues arising in this era pertained to how to share certain resources, which resources to share, and how to maintain coherency and consistency across all the cores.\nSea of accelerators (2010s): Again, driven by the decrease of Moore’s law, this era is marked by offloading more complicated tasks to accelerators (widgets) attached the the main datapath in CPUs. It’s common to see accelerators dedicated to various AI workloads, as well as image/digital processing, and cryptography. In these designs, CPUs are often described more as arbiters, deciding which tasks should be processed rather than doing the processing itself. Any task could still be run on the CPU rather than the accelerators, but the CPU would generally be slower. However, the cost of designing and especially programming the accelerator became be a non-trivial hurdle that led to a spike of interest in design-specific libraries (DSLs).\nPresence in data centers: Although we often hear that GPUs dominate the data center marker, CPUs are still well suited for tasks that don’t inherently possess a large amount of parallelism. CPUs often handle serial and small tasks and coordinate the data center as a whole.\nOn the edge: Given the tighter resource constraints on the edge, edge CPUs often only implement a subset of the techniques developed in the sing-core era because these optimizations tend to be heavy on power and area consumption. Edge CPUs still maintain a relatively simple datapath with limited memory capacities.\n\nTraditionally, CPUs have been synonymous with general-purpose computing–a term that has also changed as the “average” workload a consumer would run changes over time. For example, floating point components were once considered reserved for “scientific computing” so it was usually implemented as a co-processor (a modular component that worked in tandem with the datapath) and seldom deployed to average consumers. Compare this attitude to today, where FPUs are built into every datapath.\n\nAdvatages\nWhile limited in raw throughput, general-purpose CPUs do provide some practical benefits for AI acceleration.\n\nGeneral Programmability\nCPUs support diverse workloads beyond ML, providing flexible general-purpose programmability. This versatility comes from their standardized instruction sets and mature compiler ecosystems that allow running any application from databases and web servers to analytics pipelines (Hennessy and Patterson 2019).\n\nHennessy, John L, and David A Patterson. 2019. “A New Golden Age for Computer Architecture.” Commun. ACM 62 (2): 48–60.\nThis avoids the need for dedicated ML accelerators and enables leveraging existing CPU-based infrastructure for basic ML deployment. For example, X86 servers from vendors like Intel and AMD can run common ML frameworks using Python and TensorFlow packages alongside other enterprise workloads.\n\n\nMature Software Ecosystem\nFor decades, highly optimized math libraries like BLAS, LAPACK, and FFTW have leveraged vectorized instructions and multithreading on CPUs (Dongarra 2009). Major ML frameworks like PyTorch, TensorFlow, and SciKit-Learn are designed to integrate seamlessly with these CPU math kernels.\n\nDongarra, Jack J. 2009. “The Evolution of High Performance Computing on System z.” IBM Journal of Research and Development 53: 3–4.\nHardware vendors like Intel and AMD also provide low-level libraries to fully optimize performance for deep learning primitives (AI Inference Acceleration on CPUs). This robust, mature software ecosystem allows quickly deploying ML on existing CPU infrastructure.\n\n\nWide Availability\nThe economies of scale of CPU manufacturing, driven by demand across many markets like PCs, servers, and mobile, make them ubiquitously available. Intel CPUs, for example, have powered most servers for decades (Ranganathan 2011). This wide availability in data centers reduces hardware costs for basic ML deployment.\n\nRanganathan, Parthasarathy. 2011. “From Microprocessors to Nanostores: Rethinking Data-Centric Systems.” Computer (Long Beach Calif.) 44 (1): 39–48.\nEven small embedded devices typically integrate some CPU, enabling edge inference. The ubiquity reduces need for purchasing specialized ML accelerators in many situations.\n\n\nLow Power for Inference\nOptimizations like vector extensions in ARM Neon and Intel AVX provide power efficient integer and floating point throughput optimized for “bursty” workloads like inference (Ignatov et al. 2018a). While slower than GPUs, CPU inference can be deployed in power-constrained environments. For example, ARM’s Cortex-M CPUs now deliver over 1 TOPS of INT8 performance under 1W, enabling keyword spotting and vision applications on edge devices (ARM).\n\nIgnatov, Andrey, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018a. “AI Benchmark: Running Deep Neural Networks on Android Smartphones.”\n\n\n\nDisadvatages\nWhile providing some advantages, general-purpose CPUs also come with limitations for AI workloads.\n\nLower Throughput than Accelerators\nCPUs lack the specialized architectures for massively parallel processing that GPUs and other accelerators provide. Their general-purpose design results in lower computational throughput for the highly parallelizable math operations common in ML models (Norman P. Jouppi et al. 2017b).\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017b. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual International Symposium on Computer Architecture, 1–12.\n\n\nNot Optimized for Data Parallelism\nThe architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI (Sze et al. 2017a). They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks (AI Inference Acceleration on CPUs).\n\nSze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017a. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” March. https://arxiv.org/abs/1703.09039.\nGPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.\n\n\nHigher Memory Latency\nCPUs suffer from higher latency accessing main memory relative to GPUs and other accelerators (DDR). Techniques like tiling and caching can help, but the physical separation from off-chip RAM bottlenecks data-intensive ML workloads. This emphasizes the need for specialized memory architectures in ML hardware.\n\n\nPower Inefficiency Under Heavy Workloads\nWhile suitable for intermittent inference, sustaining near-peak throughput for training results in inefficient power consumption on CPUs, especially mobile CPUs (Ignatov et al. 2018b). Accelerators explicitly optimize the dataflow, memory, and computation for sustained ML workloads. For training large models, CPUs are energy-inefficient.\n\n\n\n\n11.3.6 Comparison\n\n\n\n\n\n\n\n\n\nAccelerator\nDescription\nKey Advantages\nKey Disadvantages\n\n\n\n\nASICs\nCustom ICs designed for target workload like AI inference\n- Maximizes perf/watt - Optimized for tensor ops- Low latency on-chip memory\n- Fixed architecture lacks flexibility- High NRE cost- Long design cycles\n\n\nFPGAs\nReconfigurable fabric with programmable logic and routing\n- Flexible architecture- Low latency memory access\n- Lower perf/watt than ASICs- Complex programming\n\n\nGPUs\nOriginally for graphics, now used for neural network acceleration\n- High throughput- Parallel scalability- Software ecosystem with CUDA\n- Not as power efficient as ASICs - Require high memory bandwidth\n\n\nCPUs\nGeneral purpose processors\n- Programmability- Ubiquitous availability\n- Lower performance for AI workloads\n\n\n\nIn general, CPUs provide a readily available baseline, GPUs deliver broadly accessible acceleration, FPGAs offer programmability, and ASICs maximize efficiency for fixed functions. The optimal choice depends on the scale, cost, flexibility and other requirements of the target application.\nAlthough first developed for data center deployment, where [cite some benefit that google cites], Google has also put considerable effort into developing Edge TPUs. These Edge TPUs maintain the inspiration from systolic arrays but are tailored to the limited resources accessible at the edge."
+ "text": "11.3 Accelerator Types\nHardware accelerators can take on many forms. They can exist as a widget (like the Neural Engine in the Apple M1 chip) or as entire chips specially designed to perform certain tasks very well. In this section, we will examine processors for machine learning workloads along the spectrum from highly specialized ASICs to more general-purpose CPUs. We first focus on custom hardware purpose-built for AI to understand the most extreme optimizations possible when design constraints are removed. This establishes a ceiling for performance and efficiency.\nWe then progressively consider more programmable and adaptable architectures with discussions of GPUs and FPGAs. These make tradeoffs in customization to maintain flexibility. Finally, we cover general-purpose CPUs which sacrifice optimizations for a particular workload in exchange for versatile programmability across applications.\nBy structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs in accelerator design between utilization, efficiency, programmability, and flexibility. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization.\nThe progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space.\n This graph illustrates the complex interplay between flexibility, performance, functional diversity, and area for various types of hardware processors. As the area of a processor increases, so does its potential for functional diversity and flexibility. However, a design could dedicate that additional area to target application specific tasks. Performance will always depend on how and how effectively that area is utilized. Optimal design always requires balancing these factors according to a hierarchy of application requirements.\n\n11.3.1 Application-Specific Integrated Circuits (ASICs)\nAn Application-Specific Integrated Circuit (ASIC) is a type of integrated circuit (IC) that is custom-designed for a specific application or workload, rather than for general-purpose use. Unlike CPUs and GPUs, ASICs do not support multiple applications or workloads. Rather, they are optimized to perform a single task extremely efficiently. Apple’s M1/2/3, AMD’s Neoverse, Intel’s i5/7/9, Google’s TPUs, and NVIDIA’s GPUs are all examples of ASICs.\nASICs achieve this efficiency by tailoring every aspect of the chip design - the underlying logic gates, electronic components, architecture, memory, I/O, and manufacturing process - specifically for the target application. This level of customization allows removing any unnecessary logic or functionality required for general computation. The result is an IC that maximizes performance and power efficiency on the desired workload. The efficiency gains from application-specific hardware are so substantial that these software-centric firms are dedicating enormous engineering resources to designing customized ASICs.\nThe rise of more complex machine learning algorithms has made the performance advantages enabled by tailored hardware acceleration a key competitive differentiator, even for companies traditionally concentrated on software engineering. ASICs have become a high-priority investment for major cloud providers aiming to offer faster AI computation.\n\nAdvantages\nASICs provide significant benefits over general purpose processors like CPUs and GPUs due to their customized nature. The key advantages include the following.\n\nMaximized Performance and Efficiency\nThe most fundamental advantage of ASICs is the ability to maximize performance and power efficiency by customizing the hardware architecture specifically for the target application. Every transistor and design aspect is optimized for the desired workload - no unnecessary logic or overhead is needed to support generic computation.\nFor example, Google’s Tensor Processing Units (TPUs) contain architectures tailored exactly for the matrix multiplication operations used in neural networks. To design the TPU ASICs, Google’s engineering teams need to clearly define the chip specifications, write the architecture description using Hardware Description Languages like Verilog, synthesize the design to map it to hardware components, and carefully place-and-route transistors and wires based on the fabrication process design rules. This complex design process, known as very-large-scale integration (VLSI), allows them to build an IC optimized just for machine learning workloads.\nAs a result, TPU ASICs achieve over an order of magnitude higher efficiency in operations per watt than general purpose GPUs on ML workloads by maximizing performance and minimizing power consumption through a full-stack custom hardware design.\n\n\nSpecialized On-Chip Memory\nASICs incorporate on-chip SRAM and caches specifically optimized to feed data to the computational units. For example, Apple’s M1 system-on-a-chip contains special low-latency SRAM to accelerate the performance of its Neural Engine machine learning hardware. Large local memory with high bandwidth enables keeping data as close as possible to the processing elements. This provides tremendous speed advantages compared to off-chip DRAM access, which is up to 100x slower.\nData locality and optimizing memory hierarchy is crucial for both high throughput and low power.Below is a table “Numbers Everyone Should Know” from Jeff Dean.\n\n\n\nOperation\nLatency\nNotes\n\n\n\n\nL1 cache reference\n0.5 ns\n\n\n\nBranch mispredict\n5 ns\n\n\n\nL2 cache reference\n7 ns\n\n\n\nMutex lock/unlock\n25 ns\n\n\n\nMain memory reference\n100 ns\n\n\n\nCompress 1K bytes with Zippy\n3,000 ns\n3 μs\n\n\nSend 1 KB bytes over 1 Gbps network\n10,000 ns\n10 μs\n\n\nRead 4 KB randomly from SSD\n150,000 ns\n150 μs\n\n\nRead 1 MB sequentially from memory\n250,000 ns\n250 μs\n\n\nRound trip within same datacenter\n500,000 ns\n0.5 ms\n\n\nRead 1 MB sequentially from SSD\n1,000,000 ns\n1 ms\n\n\nDisk seek\n10,000,000 ns\n10 ms\n\n\nRead 1 MB sequentially from disk\n20,000,000 ns\n20 ms\n\n\nSend packet CA->Netherlands->CA\n150,000,000 ns\n150 ms\n\n\n\n\n\nCustom Datatypes and Operations\nUnlike general purpose processors, ASICs can be designed to natively support custom datatypes like INT4 or bfloat16 that are widely used in ML models. For instance, Nvidia’s Ampere GPU architecture has dedicated bfloat16 Tensor Cores to accelerate AI workloads. Low precision datatypes enable higher arithmetic density and performance. ASICs can also directly incorporate non-standard operations common in ML algorithms as primitive operations - for example, natively supporting activation functions like ReLU makes execution more efficient. We encourage you to refer to the Efficient Numeric Representations chapter for additional details.\n\n\nHigh Parallelism\nASIC architectures can leverage much higher parallelism tuned for the target workload versus general purpose CPUs or GPUs. More computational units tailored for the application means more operations execute simultaneously. Highly parallel ASICs achieve tremendous throughput for data parallel workloads like neural network inference.\n\n\nAdvanced Process Nodes\nCutting edge manufacturing processes allow packing more transistors into smaller die areas, increasing density. ASICs designed specifically for high volume applications can better amortize the costs of bleeding edge process nodes.\n\n\n\nDisadvantages\n\nLong Design Timelines\nThe engineering process of designing and validating an ASIC can take 2-3 years. Synthesizing the architecture using hardware description languages, taping out the chip layout, and fabricating the silicon on advanced process nodes involves long development cycles. For example, to tape out a 7nm chip, teams need to carefully define specifications, write the architecture in HDL, synthesize the logic gates, place components, route all interconnections, and finalize the layout to send for fabrication. This very large scale integration (VLSI) flow means ASIC design and manufacturing can traditionally take 2-5 years.\nThere are a few key reasons why the long design timelines of ASICs, often 2-3 years, can be challenging for machine learning workloads:\n\nML algorithms evolve rapidly: New model architectures, training techniques, and network optimizations are constantly emerging. For example, Transformers became hugely popular in NLP in just the last few years. By the time an ASIC finishes tapeout, the optimal architecture for a workload may have changed.\nDatasets grow quickly: ASICs designed for certain model sizes or datatypes can become undersized relative to demand. For instance, natural language models are scaling exponentially with more data and parameters. A chip designed for BERT might not accommodate GPT-3.\nML applications change frequently: The industry focus shifts between computer vision, speech, NLP, recommender systems etc. An ASIC optimized for image classification may have less relevance in a few years.\nFaster design cycles with GPUs/FPGAs: Programmable accelerators like GPUs can adapt much quicker by upgrading software libraries and frameworks. New algorithms can be deployed without hardware changes.\nTime-to-market needs: Getting a competitive edge in ML requires rapidly experimenting with new ideas and deploying them. Waiting several years for an ASIC is not aligned with fast iteration.\n\nThe pace of innovation in ML is not well matched to the multi-year timescale for ASIC development. Significant engineering efforts are required to extend ASIC lifespan through modular architectures, process scaling, model compression, and other techniques. But the rapid evolution of ML makes fixed function hardware challenging.\n\n\nHigh Non-Recurring Engineering Costs\nThe fixed costs of taking an ASIC from design to high volume manufacturing can be very capital intensive, often tens of millions of dollars. Photomask fabrication for taping out chips in advanced process nodes, packaging, and one-time engineering efforts are expensive. For instance, a 7nm chip tapeout alone could cost tens of millions of dollars. The high non-recurring engineering (NRE) investment narrows ASIC viability to high-volume production use cases where the upfront cost can be amortized.\n Table from Enabling Cheaper Design\n\n\nComplex Integration and Programming\nASICs require extensive software integration work including drivers, compilers, OS support, and debugging tools. They also need expertise in electrical and thermal packaging. Additionally, programming ASIC architectures efficiently can involve challenges like workload partitioning and scheduling across many parallel units. The customized nature necessitates significant integration efforts to turn raw hardware into fully operational accelerators.\nWhile ASICs provide massive efficiency gains on target applications by tailoring every aspect of the hardware design to one specific task, their fixed nature results in tradeoffs in flexibility and development costs compared to programmable accelerators, which must be weighed based on the application.\n\n\n\n\n11.3.2 Field-Programmable Gate Arrays (FPGAs)\nFPGAs are programmable integrated circuits that can be reconfigured for different applications. Their customizable nature provides advantages for accelerating AI algorithms compared to fixed ASICs or inflexible GPUs. While Google, Meta, and NVIDIA which are looking at putting ASICs in data centers, Microsoft deployed FPGAs in their data centers (Putnam et al. 2014) in 2011 to efficiently serve diverse data center workloads.\n\nAdvantages\nFPGAs provide several benefits over GPUs and ASICs for accelerating machine learning workloads.\n\nFlexibility Through Reconfigurable Fabric\nThe key advantage of FPGAs is the ability to reconfigure the underlying fabric to implement custom architectures optimized for different models, unlike fixed-function ASICs. For example, quant trading firms use FPGAs to accelerate their algorithms because they change frequently, and the low NRE cost of FPGAs is more viable than taping out new ASICs.\n Comparison of FPGAs on the market (Gwennap, n.d.)\nGwennap, Linley. n.d. “Certus-NX Innovates General-Purpose FPGAs.”\n\nFPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a base amount of these resources, and engineers program the chips by compiling HDL code into bitstreams that rearrange the fabric into different configurations. This makes FPGAs adaptable as algorithms evolve.\nWhile FPGAs may not achieve the utmost performance and efficiency of workload-specific ASICs, their programmability provides more flexibility as algorithms change. This adaptability makes FPGAs a compelling choice for accelerating evolving machine learning applications. For machine learning workloads, Microsoft has deployed FPGAs in its Azure data centers to serve diverse applications, instead of using ASICs. The programmability enables optimization across changing ML models.\n\n\nCustomized Parallelism and Pipelining\nFPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, Intel’s HARPv2 FPGA platform splits the layers of an MNIST convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model’s structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs.\n\n\nLow Latency On-Chip Memory\nLarge amounts of high bandwidth on-chip memory enables localized storage for weights and activations. For instance, Xilinx Versal FPGAs contain 32MB of low latency RAM blocks along with dual-channel DDR4 interfaces for external memory. Bringing memory physically closer to the compute units reduces access latency. This provides significant speed advantages over GPUs that must traverse PCIe or other system buses to reach off-chip GDDR6 memory.\n\n\nNative Support for Low Precision\nA key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16 used in quantized ML models. For example, Intel’s Stratix 10 NX FPGAs have dedicated INT8 cores that can achieve up to 143 INT8 TOPS at ~1 TOPS/W Intel® Stratix® 10 NX FPGA. Lower bit widths increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime.\n\n\n\nDisadvatages\n\nLower Peak Throughput than ASICs\nFPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 petaOps of INT8 performance while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS Intel® Stratix® 10 NX FPGA.\nThis is because FPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a set amount of these resources. To program FPGAs, engineers write HDL code and compile into bitstreams that rearrange the fabric, which has inherent overheads versus an ASIC purpose-built for one computation.\n\n\nProgramming Complexity\nTo optimize FPGA performance, engineers must program the architectures in low-level hardware description languages like Verilog or VHDL. This requires hardware design expertise and longer development cycles versus higher level software frameworks like TensorFlow. Maximizing utilization can be challenging despite advances in high-level synthesis from C/C++.\n\n\nReconfiguration Overheads\nTo change FPGA configurations requires reloading a new bitstream, which has considerable latency and storage size costs. For example, partial reconfiguration on Xilinx FPGAs can take 100s of milliseconds. This makes dynamically swapping architectures in real-time infeasible. The bitstream storage also consumes on-chip memory.\n\n\nDiminishing Gains on Advanced Nodes\nWhile smaller process nodes benefit ASICs greatly, they provide less advantages for FPGAs. At 7nm and below, effects like process variation, thermal constraints, and aging disproportionately impact FPGA performance. The overheads of configurable fabric also diminish gains vs fixed function ASICs.\n\n\nCase Study\nFPGAs have found widespread application in various fields, including medical imaging, robotics, and finance, where they excel in handling computationally intensive machine learning tasks. In the context of medical imaging, an illustrative example is the application of FPGAs for brain tumor segmentation, a traditionally time-consuming and error-prone process. For instance, Xiong et al. developed a quantized segmentation accelerator, which they retrained using the BraTS19 and BraTS20 datasets. Their work yielded remarkable results, achieving over 5x and 44x performance improvements, as well as 11x and 82x energy efficiency gains compared to GPU and CPU implementations, respectively (Xiong et al. 2021).\n\nXiong, Siyu, Guoqing Wu, Xitian Fan, Xuan Feng, Zhongcheng Huang, Wei Cao, Xuegong Zhou, et al. 2021. “MRI-Based Brain Tumor Segmentation Using FPGA-Accelerated Neural Network.” BMC Bioinformatics 22 (1): 421. https://doi.org/10.1186/s12859-021-04347-6.\n\n\n\n\n11.3.3 Digital Signal Processors (DSPs)\nThe first digital signal processor core was built in 1948 by Texas Instruments (“The Evolution of Audio DSPs”). Traditionally, DSPs would have logic to allow them to directly access digital/audio data in memory, perform an arithmetic operation (multiply-add-accumulate–MAC–was one of the most common operations) and then write the result back to memory. The DSP would also include specialized analog components to retrieve said digital/audio data.\nOnce we entered the smartphone era, DSPs started encompassing more sophisticated tasks. They required Bluetooth, Wi-Fi, and cellular connectivity. Media also became much more complex. Today, it’s not common to have entire chips dedicated to just DSP, but a System on Chip would include DSPs in addition to general-purpose CPUs. For example, Qualcomm’s Hexagon Digital Signal Processor claims to be a “world-class processor with both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modem functions.” Google Tensors, the chip in the Google Pixel phones, also includes both CPUs and specialized DSP engines.\n\nAdvatages\nDSPs architecturally provide advantages in vector math throughput, low latency memory access, power efficiency, and support for diverse datatypes - making them well-suited for embedded ML acceleration.\n\nOptimized Architecture for Vector Math\nDSPs contain specialized data paths, register files, and instructions optimized specifically for vector math operations commonly used in machine learning models. This includes dot product engines, MAC units, and SIMD capabilities tailored for vector/matrix calculations. For example, the CEVA-XM6 DSP (“Ceva SensPro Fuses AI and Vector DSP”) has 512-bit vector units to accelerate convolutions. This efficiency on vector math workloads is far beyond general CPUs.\n\n\nLow Latency On-Chip Memory\nDSPs integrate large amounts of fast on-chip SRAM memory to hold data locally for processing. Bringing memory physically closer to the computation units reduces access latency. For example, Analog’s SHARC+ DSP contains 10MB of on-chip SRAM. This high-bandwidth local memory provides speed advantages for real-time applications.\n\n\nPower Efficiency\nDSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, Qualcomm’s Hexagon DSP can deliver 4 trillion operations per second (TOPS) while consuming minimal watts.\n\n\nSupport for Integer and Floating Point Math\nUnlike GPUs which excel at single or half precision, DSPs can natively support both 8/16-bit integer and 32-bit floating point datatypes used across ML models. Some DSPs even support dot product acceleration at INT8 precision for quantized neural networks.\n\n\n\nDisadvatages\nDSPs make architectural tradeoffs that limit peak throughput, precision, and model capacity compared to other AI accelerators. But their advantages in power efficiency and integer math make them a strong edge compute option. So while DSPs provide some benefits over CPUs, they also come with limitations for machine learning workloads:\n\nLower Peak Throughput than ASICs/GPUs\nDSPs cannot match the raw computational throughput of GPUs or customized ASICs designed specifically for machine learning. For example, Qualcomm’s Cloud AI 100 ASIC delivers 480 TOPS on INT8, while their Hexagon DSP provides 10 TOPS. DSPs lack the massive parallelism of GPU SM units.\n\n\nSlower Double Precision Performance\nMost DSPs are not optimized for higher precision floating point needed in some ML models. Their dot product engines focus on INT8/16 and FP32 which provides better power efficiency. But 64-bit floating point throughput is much lower. This can limit usage in models requiring high precision.\n\n\nConstrained Model Capacity\nThe limited on-chip memory of DSPs constrains the model sizes that can be run. Large deep learning models with hundreds of megabytes of parameters would exceed on-chip SRAM capacity. DSPs are best suited for small to mid-sized models targeted for edge devices.\n\n\nProgramming Complexity\nEfficiently programming DSP architectures requires expertise in parallel programming and optimizing data access patterns. Their specialized microarchitectures have more learning curve than high-level software frameworks. This makes development more complex.\n\n\n\n\n11.3.4 Graphics Processing Units (GPUs)\nThe term graphics processing unit existed since at least the 1980s. There had always been a demand for graphics hardware in both video game consoles (high demand, needed to be relatively lower cost) and scientific simulations (lower demand, but needed higher resolution, could be at a high price point).\nThe term was popularized, however, in 1999 when NVIDIA launched the GeForce 256 mainly targeting the PC games market sector (Lindholm et al. 2008). As PC games became more sophisticated, NVIDIA GPUs became more programmable over time as well. Soon, users realized they could take advantage of this programmability and run a variety of non-graphics related workloads on GPUs and benefit from the underlying architecture. And so, starting in the late 2000s, GPUs became general-purpose graphics processing units or GP-GPUs.\n\nLindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008. “NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/MM.2008.31.\nIntel Arc Graphics and AMD Radeon RX have also developed their GPUs over time.\n\nAdvatages\n\nHigh Computational Throughput\nThe key advantage of GPUs is their ability to perform massively parallel floating point calculations optimized for computer graphics and linear algebra (Raina, Madhavan, and Ng 2009). Modern GPUs like Nvidia’s A100 offers up to 19.5 teraflops of FP32 performance with 6912 CUDA cores and 40GB of graphics memory that is tightly coupled with 1.6TB/s of graphics memory bandwidth.\n\nRaina, Rajat, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-Scale Deep Unsupervised Learning Using Graphics Processors.” In Proceedings of the 26th Annual International Conference on Machine Learning, 873–80. Montreal Quebec Canada: ACM. https://doi.org/10.1145/1553374.1553486.\nThis raw throughput stems from the highly parallel streaming multiprocessor (SM) architecture tailored for data-parallel workloads (Zhihao Jia, Zaharia, and Aiken 2019). Each SM contains hundreds of scalar cores optimized for float32/64 math. With thousands of SMs on chip, GPUs are purpose-built for matrix multiplication and vector operations used throughout neural networks.\nFor example, Nvidia’s latest H100 GPU provides 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32, 67 TFLOPs of FP32 and 34 TFLOPs of FP64 Compute performance, which can dramatically accelerate large batch training on models like BERT, GPT-3, and other transformer architectures. The scalable parallelism of GPUs is key to speeding up computationally intensive deep learning.\n\n\nMature Software Ecosystem\nNvidia provides extensive runtime libraries like cuDNN and cuBLAS that are highly optimized for deep learning primitives. Frameworks like TensorFlow and PyTorch integrate with these libraries to enable GPU acceleration with no direct programming. CUDA provides lower-level control for custom computations.\nThis ecosystem enables quickly leveraging GPUs via high-level Python without GPU programming expertise. Known workflows and abstractions provide a convenient on-ramp for scaling up deep learning experiments. The software maturity supplements the throughput advantages.\n\n\nBroad Availability\nThe economies of scale of graphics processing make GPUs broadly accessible in data centers, cloud platforms like AWS and GCP, and desktop workstations. Their availability in research environments has provided a convenient platform for ML experimentation and innovation. For example, nearly every state-of-the-art deep learning result has involved GPU acceleration because of this ubiquity. The broad access supplements the software maturity to make GPUs the standard ML accelerator.\n\n\nProgrammable Architecture\nWhile not fully flexible as FPGAs, GPUs do provide programmability via CUDA and shader languages to customize computations. Developers can optimize data access patterns, create new ops, and tune precisions for evolving models and algorithms.\n\n\n\nDisadvatages\nWhile GPUs have become the standard accelerator for deep learning, their architecture also comes with some key downsides.\n\nLess Efficient than Custom ASICs\nThe statement “GPUs are less efficient than ASICs” could spark intense debate within the ML/AI field and cause this book to explode 🤯.\nTypically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.\nHowever, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.\nConsequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.\n\n\nHigh Memory Bandwidth Needs\nThe massively parallel architecture requires tremendous memory bandwidth to supply thousands of cores as shown in Figure 1. For example, the Nvidia A100 GPU requires 1.6TB/sec to fully saturate its compute. GPUs rely on wide 384-bit memory buses to high bandwidth GDDR6 RAM, but even the fastest GDDR6 tops out around 1 TB/sec. This dependence on external DRAM incurs latency and power overheads.\n\n\nProgramming Complexity\nWhile tools like CUDA help, optimally mapping and partitioning ML workloads across the massively parallel GPU architecture remains challenging. Achieving both high utilization and memory locality requires low-level tuning (Zhe Jia et al. 2018). Abstractions like TensorFlow can leave performance on the table.\n\nJia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.” arXiv. http://arxiv.org/abs/1804.06826.\n\n\nLimited On-Chip Memory\nGPUs have relatively small on-chip memory caches compared to the large working set requirements of ML models during training. They are reliant on high bandwidth access to external DRAM, which ASICs minimize with large on-chip SRAM.\n\n\nFixed Architecture\nUnlike FPGAs, the fundamental GPU architecture cannot be altered post-manufacture. This constraint limits adapting to novel ML workloads or layers. The CPU-GPU boundary also creates data movement overheads.\n\n\n\nCase Study\nThe recent groundbreaking research conducted by OpenAI (Brown et al. 2020) with their GPT-3 model. GPT-3, a language model consisting of 175 billion parameters, demonstrated unprecedented language understanding and generation capabilities. Its training, which would have taken months on conventional CPUs, was accomplished in a matter of days using powerful GPUs, thus pushing the boundaries of natural language processing (NLP) capabilities.\n\n\n\n11.3.5 Central Processing Units (CPUs)\nThe term CPUs has a long history that dates back to 1955 (Weik 1955) while the first microprocessor CPU–the Intel 4004–was invented in 1971 (Who Invented the Microprocessor?). Compilers compile high-level programming languages like Python, Java, or C to assembly instructions (x86, ARM, RISC-V, etc.) for CPUs to process. The set of instructions a CPU understands is called the “instruction set” and must be agreed upon by both the hardware and software running atop it (See section 5 for a more in-depth description on instruction set architectures–ISAs).\n\nWeik, Martin H. 1955. A Survey of Domestic Electronic Digital Computing Systems. Ballistic Research Laboratories.\nAn overview of significant developments in CPUs:\n\nSingle-core Era (1950s- 2000): This era is known for seeing aggressive microarchitectural improvements. Techniques like speculative execution (executing an instruction before the previous one was done), out-of-order execution (re-ordering instructions to be more effective), and wider issue widths (executing multiple instructions at once) were implemented to increase instruction throughput. The term “System on Chip” also originated in this era as different analog components (components designed with transistors) and digital components (components designed with hardware description languages that are mapped to transistors) were put on the same platform to achieve some task.\nMulti-core Era (2000s): Driven by the decrease of Moore’s Law, this era is marked by scaling the number of cores within a CPU. Now tasks can be split across many different cores each with its own datapath and control unit. Many of the issues arising in this era pertained to how to share certain resources, which resources to share, and how to maintain coherency and consistency across all the cores.\nSea of accelerators (2010s): Again, driven by the decrease of Moore’s law, this era is marked by offloading more complicated tasks to accelerators (widgets) attached the the main datapath in CPUs. It’s common to see accelerators dedicated to various AI workloads, as well as image/digital processing, and cryptography. In these designs, CPUs are often described more as arbiters, deciding which tasks should be processed rather than doing the processing itself. Any task could still be run on the CPU rather than the accelerators, but the CPU would generally be slower. However, the cost of designing and especially programming the accelerator became be a non-trivial hurdle that led to a spike of interest in design-specific libraries (DSLs).\nPresence in data centers: Although we often hear that GPUs dominate the data center marker, CPUs are still well suited for tasks that don’t inherently possess a large amount of parallelism. CPUs often handle serial and small tasks and coordinate the data center as a whole.\nOn the edge: Given the tighter resource constraints on the edge, edge CPUs often only implement a subset of the techniques developed in the sing-core era because these optimizations tend to be heavy on power and area consumption. Edge CPUs still maintain a relatively simple datapath with limited memory capacities.\n\nTraditionally, CPUs have been synonymous with general-purpose computing–a term that has also changed as the “average” workload a consumer would run changes over time. For example, floating point components were once considered reserved for “scientific computing” so it was usually implemented as a co-processor (a modular component that worked in tandem with the datapath) and seldom deployed to average consumers. Compare this attitude to today, where FPUs are built into every datapath.\n\nAdvatages\nWhile limited in raw throughput, general-purpose CPUs do provide some practical benefits for AI acceleration.\n\nGeneral Programmability\nCPUs support diverse workloads beyond ML, providing flexible general-purpose programmability. This versatility comes from their standardized instruction sets and mature compiler ecosystems that allow running any application from databases and web servers to analytics pipelines (Hennessy and Patterson 2019).\n\nHennessy, John L, and David A Patterson. 2019. “A New Golden Age for Computer Architecture.” Commun. ACM 62 (2): 48–60.\nThis avoids the need for dedicated ML accelerators and enables leveraging existing CPU-based infrastructure for basic ML deployment. For example, X86 servers from vendors like Intel and AMD can run common ML frameworks using Python and TensorFlow packages alongside other enterprise workloads.\n\n\nMature Software Ecosystem\nFor decades, highly optimized math libraries like BLAS, LAPACK, and FFTW have leveraged vectorized instructions and multithreading on CPUs (Dongarra 2009). Major ML frameworks like PyTorch, TensorFlow, and SciKit-Learn are designed to integrate seamlessly with these CPU math kernels.\n\nDongarra, Jack J. 2009. “The Evolution of High Performance Computing on System z.” IBM Journal of Research and Development 53: 3–4.\nHardware vendors like Intel and AMD also provide low-level libraries to fully optimize performance for deep learning primitives (AI Inference Acceleration on CPUs). This robust, mature software ecosystem allows quickly deploying ML on existing CPU infrastructure.\n\n\nWide Availability\nThe economies of scale of CPU manufacturing, driven by demand across many markets like PCs, servers, and mobile, make them ubiquitously available. Intel CPUs, for example, have powered most servers for decades (Ranganathan 2011). This wide availability in data centers reduces hardware costs for basic ML deployment.\n\nRanganathan, Parthasarathy. 2011. “From Microprocessors to Nanostores: Rethinking Data-Centric Systems.” Computer (Long Beach Calif.) 44 (1): 39–48.\nEven small embedded devices typically integrate some CPU, enabling edge inference. The ubiquity reduces need for purchasing specialized ML accelerators in many situations.\n\n\nLow Power for Inference\nOptimizations like vector extensions in ARM Neon and Intel AVX provide power efficient integer and floating point throughput optimized for “bursty” workloads like inference (Ignatov et al. 2018a). While slower than GPUs, CPU inference can be deployed in power-constrained environments. For example, ARM’s Cortex-M CPUs now deliver over 1 TOPS of INT8 performance under 1W, enabling keyword spotting and vision applications on edge devices (ARM).\n\nIgnatov, Andrey, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018a. “AI Benchmark: Running Deep Neural Networks on Android Smartphones.”\n\n\n\nDisadvatages\nWhile providing some advantages, general-purpose CPUs also come with limitations for AI workloads.\n\nLower Throughput than Accelerators\nCPUs lack the specialized architectures for massively parallel processing that GPUs and other accelerators provide. Their general-purpose design results in lower computational throughput for the highly parallelizable math operations common in ML models (Norman P. Jouppi et al. 2017b).\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017b. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual International Symposium on Computer Architecture, 1–12.\n\n\nNot Optimized for Data Parallelism\nThe architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI (Sze et al. 2017a). They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks (AI Inference Acceleration on CPUs).\n\nSze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017a. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” March. https://arxiv.org/abs/1703.09039.\nGPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.\n\n\nHigher Memory Latency\nCPUs suffer from higher latency accessing main memory relative to GPUs and other accelerators (DDR). Techniques like tiling and caching can help, but the physical separation from off-chip RAM bottlenecks data-intensive ML workloads. This emphasizes the need for specialized memory architectures in ML hardware.\n\n\nPower Inefficiency Under Heavy Workloads\nWhile suitable for intermittent inference, sustaining near-peak throughput for training results in inefficient power consumption on CPUs, especially mobile CPUs (Ignatov et al. 2018b). Accelerators explicitly optimize the dataflow, memory, and computation for sustained ML workloads. For training large models, CPUs are energy-inefficient.\n\n\n\n\n11.3.6 Comparison\n\n\n\n\n\n\n\n\n\nAccelerator\nDescription\nKey Advantages\nKey Disadvantages\n\n\n\n\nASICs\nCustom ICs designed for target workload like AI inference\n- Maximizes perf/watt - Optimized for tensor ops- Low latency on-chip memory\n- Fixed architecture lacks flexibility- High NRE cost- Long design cycles\n\n\nFPGAs\nReconfigurable fabric with programmable logic and routing\n- Flexible architecture- Low latency memory access\n- Lower perf/watt than ASICs- Complex programming\n\n\nGPUs\nOriginally for graphics, now used for neural network acceleration\n- High throughput- Parallel scalability- Software ecosystem with CUDA\n- Not as power efficient as ASICs - Require high memory bandwidth\n\n\nCPUs\nGeneral purpose processors\n- Programmability- Ubiquitous availability\n- Lower performance for AI workloads\n\n\n\nIn general, CPUs provide a readily available baseline, GPUs deliver broadly accessible acceleration, FPGAs offer programmability, and ASICs maximize efficiency for fixed functions. The optimal choice depends on the scale, cost, flexibility and other requirements of the target application.\nAlthough first developed for data center deployment, where [cite some benefit that google cites], Google has also put considerable effort into developing Edge TPUs. These Edge TPUs maintain the inspiration from systolic arrays but are tailored to the limited resources accessible at the edge."
},
{
"objectID": "hw_acceleration.html#hardware-software-co-design",
@@ -669,7 +669,7 @@
"href": "hw_acceleration.html#future-trends",
"title": "11 AI Acceleration",
"section": "11.9 Future Trends",
- "text": "11.9 Future Trends\nThus far in this chapter, we have primarily explored how to design specialized hardware that is optimized for machine learning workloads and algorithms. For example, we discussed how GPUs and TPUs have architectures tailored for neural network training and inference. However, we have not yet discussed an emerging and exciting area - using machine learning to aid in the hardware design process itself.\nThe hardware design process involves many complex stages, including specification, high-level modeling, simulation, synthesis, verification, prototyping, and fabrication. Traditionally, much of this process requires extensive human expertise, effort, and time. However, recent advances in machine learning are enabling parts of the hardware design workflow to be automated and enhanced using ML techniques.\nSome examples of how ML is transforming hardware design include:\n\nAutomated circuit synthesis using reinforcement learning: Rather than hand-crafting transistor-level designs, ML agents can learn to connect logic gates and generate circuit layouts automatically. This can accelerate the time-consuming syntheses process.\nML-based hardware simulation and emulation: Deep neural network models can be trained to predict how a hardware design will perform under different conditions. This allows fast and accurate simulation compared to traditional RTL simulations.\nAutomated chip floorplanning using ML algorithms: Chip floorplanning, which involves optimally placing different components on a die, can leverage genetic algorithms and ML to explore floorplan options. This can lead to performance improvements.\nML-driven architecture optimization: Novel hardware architectures, like those for efficient ML accelerators, can be automatically generated and optimized using neural architecture search techniques. This expands the architectural design space.\n\nApplying ML to hardware design automation holds enormous promise to make the process faster, cheaper, and more efficient. It opens up design possibilities that would be extremely difficult through manual design. The use of ML in hardware design is an area of active research and early deployment, and we will study the techniques involved and their transformative potential.\n\n11.9.1 ML for Hardware Design Automation\nA major opportunity for machine learning in hardware design is automating parts of the complex and tedious design workflow. Hardware design automation (HDA) broadly refers to using ML techniques like reinforcement learning, genetic algorithms, and neural networks to automate tasks like synthesis, verification, floorplanning, and more. A few examples of where ML for HDA shows real promise:\n\nAutomated circuit synthesis: Circuit synthesis involves converting a high-level description of desired logic into an optimized gate-level netlist implementation. This complex process has many design considerations and tradeoffs. ML agents can be trained through reinforcement learning to explore the design space and output optimized syntheses automatically. Startups like Symbiotic EDA are bringing this technology to market.\nAutomated chip floorplanning: Floorplanning refers to strategically placing different components on a chip die area. ML techniques like genetic algorithms can be used to automate floorplan optimization to minimize wire length, power consumption, and other objectives. This is extremely valuable as chip complexity increases.\nML hardware simulators: Training deep neural network models to predict how hardware designs will perform as simulators can accelerate the simulation process by over 100x compared to traditional RTL simulations.\nAutomated code translation: Converting hardware description languages like Verilog to optimized RTL implementations is critical but time-consuming. ML models can be trained to act as translator agents and automate parts of this process.\n\nThe benefits of HDA using ML are reduced design time, superior optimizations, and exploration of design spaces too complex for manual approaches. This can accelerate hardware development and lead to better designs.\nChallenges include limits of ML generalization, the black-box nature of some techniques, and accuracy tradeoffs. But research is rapidly advancing to address these issues and make HDA ML solutions robust and reliable for production use. HDA provides a major avenue for ML to transform hardware design.\n\n\n11.9.2 ML-Based Hardware Simulation and Verification\nSimulating and verifying hardware designs is critical before manufacturing to ensure the design behaves as intended. Traditional approaches like register-transfer level (RTL) simulation are complex and time-consuming. ML introduces new opportunities to enhance hardware simulation and verification. Some examples include:\n\nSurrogate modeling for simulation: Highly accurate surrogate models of a design can be built using neural networks. These models predict outputs from inputs much faster than RTL simulation, enabling fast design space exploration. Companies like Ansys use this technique.\nML simulators: Large neural network models can be trained on RTL simulations to learn to mimic the functionality of a hardware design. Once trained, the NN model can act as a highly efficient simulator to use for regression testing and other tasks. Graphcore has demonstrated over 100x speedup with this approach.\nFormal verification using ML: Formal verification mathematically proves properties about a design. ML techniques can help generate verification properties and can learn to solve the complex formal proofs needed. This automates parts of this challenging process. Startups like Cortical.io are bringing ML formal verification solutions to market.\nBug detection: ML models can be trained to process hardware designs and identify potential issues. This assists human designers in inspecting complex designs and finding bugs. Facebook has shown bug detection models for their server hardware.\n\nThe key benefits of applying ML to simulation and verification are faster design validation turnaround times, more rigorous testing, and reduced human effort. Challenges include verifying ML model correctness and handling corner cases. ML promises to significantly accelerate testing workflows.\n\n\n11.9.3 ML for Efficient Hardware Architectures\nDesigning hardware architectures optimized for performance, power, and efficiency is a key goal. ML introduces new techniques to automate and enhance architecture design space exploration for both general-purpose and specialized hardware like ML accelerators. Some promising examples include:\n\nNeural architecture search for hardware: Search techniques like evolutionary algorithms can automatically generate novel hardware architectures by mutating and mixing design attributes like cache size, number of parallel units, memory bandwidth, and so on. This expands the design space beyond human limitations.\nML-based architecture optimizers: ML agents can be trained with reinforcement learning to tweak architectures to optimize for desired objectives like throughput or power. The agent explores the space of possible configurations to find high-performing, efficient designs.\nPredictive modeling for optimization: - ML models can be trained to predict hardware performance, power, and efficiency metrics for a given architecture. These become “surrogate models” for fast optimization and space exploration by substituting lengthy simulations.\nSpecialized accelerator optimization: - For specialized chips like tensor processing units for AI, automated architecture search techniques based on ML/evolutionary algorithms show promise for finding fast, efficient designs.\n\nThe benefits of using ML include superior design space exploration, automated optimization, and reduced manual effort. Challenges include long training times for some techniques and local optima limitations. But ML for hardware architecture holds great potential for unlocking performance and efficiency gains.\n\n\n11.9.4 ML to Optimize Manufacturing and Reduce Defects\nOnce a hardware design is complete, it moves to manufacturing. But variability and defects during manufacturing can impact yields and quality. ML techniques are now being applied to improve fabrication processes and reduce defects. Some examples include:\n\nPredictive maintenance: ML models can analyze equipment sensor data over time and identify signals that predict maintenance needs before failure. This enables proactive upkeep that can come in very handy in the costly fabrication process.\nProcess optimization: Supervised learning models can be trained on process data to identify factors that lead to low yields. The models can then optimize parameters to improve yields, throughput, or consistency.\nYield prediction: By analyzing test data from fabricated designs using techniques like regression trees, ML models can predict yields early in production. This allows process adjustments.\nDefect detection: Computer vision ML techniques can be applied to images of designs to identify defects invisible to the human eye. This enables precision quality control and root cause analysis.\nProactive failure analysis: - By analyzing structured and unstructured process data, ML models can help predict, diagnose, and prevent issues that lead to downstream defects and failures.\n\nApplying ML to manufacturing enables process optimization, real-time quality control, predictive maintenance, and ultimately higher yields. Challenges include managing complex manufacturing data and variations. But ML is poised to transform semiconductor manufacturing.\n\n\n11.9.5 Toward Foundation Models for Hardware Design\nAs we have seen, machine learning is opening up new possibilities across the hardware design workflow, from specification to manufacturing. However, current ML techniques are still narrow in scope and require extensive domain-specific engineering. The long-term vision is the development of general artificial intelligence systems that can be applied with versatility across hardware design tasks.\nTo fully realize this vision, investment and research are needed to develop foundation models for hardware design. These are unified, general-purpose ML models and architectures that can learn complex hardware design skills with the right training data and objectives.\nRealizing foundation models for end-to-end hardware design will require:\n\nAccumulation of large, high-quality, labeled datasets across hardware design stages to train foundation models.\nAdvances in multi-modal, multi-task ML techniques to handle the diversity of hardware design data and tasks.\nInterfaces and abstraction layers to connect foundation models to existing design flows and tools.\nDevelopment of simulation environments and benchmarks to train and test foundation models on hardware design capabilities.\nMethods to explain and interpret the design decisions and optimizations made by ML models for trust and verification.\nCompilation techniques to optimize foundation models for efficient deployment across hardware platforms.\n\nWhile significant research remains, foundation models represent the most transformative long-term goal for imbuing AI into the hardware design process. Democratizing hardware design via versatile, automated ML systems promises to unlock a new era of optimized, efficient, and innovative chip design. The journey ahead is filled with open challenges and opportunities.\nWe encourage you to read Architecture 2.0 if ML-aided computer architecture design interests you. Alternatively, you can watch the below video."
+ "text": "11.9 Future Trends\nThus far in this chapter, we have primarily explored how to design specialized hardware that is optimized for machine learning workloads and algorithms. For example, we discussed how GPUs and TPUs have architectures tailored for neural network training and inference. However, we have not yet discussed an emerging and exciting area - using machine learning to aid in the hardware design process itself.\nThe hardware design process involves many complex stages, including specification, high-level modeling, simulation, synthesis, verification, prototyping, and fabrication. Traditionally, much of this process requires extensive human expertise, effort, and time. However, recent advances in machine learning are enabling parts of the hardware design workflow to be automated and enhanced using ML techniques.\nSome examples of how ML is transforming hardware design include:\n\nAutomated circuit synthesis using reinforcement learning: Rather than hand-crafting transistor-level designs, ML agents can learn to connect logic gates and generate circuit layouts automatically. This can accelerate the time-consuming syntheses process.\nML-based hardware simulation and emulation: Deep neural network models can be trained to predict how a hardware design will perform under different conditions. This allows fast and accurate simulation compared to traditional RTL simulations.\nAutomated chip floorplanning using ML algorithms: Chip floorplanning, which involves optimally placing different components on a die, can leverage genetic algorithms and ML to explore floorplan options. This can lead to performance improvements.\nML-driven architecture optimization: Novel hardware architectures, like those for efficient ML accelerators, can be automatically generated and optimized using neural architecture search techniques. This expands the architectural design space.\n\nApplying ML to hardware design automation holds enormous promise to make the process faster, cheaper, and more efficient. It opens up design possibilities that would be extremely difficult through manual design. The use of ML in hardware design is an area of active research and early deployment, and we will study the techniques involved and their transformative potential.\n\n11.9.1 ML for Hardware Design Automation\nA major opportunity for machine learning in hardware design is automating parts of the complex and tedious design workflow. Hardware design automation (HDA) broadly refers to using ML techniques like reinforcement learning, genetic algorithms, and neural networks to automate tasks like synthesis, verification, floorplanning, and more. A few examples of where ML for HDA shows real promise:\n\nAutomated circuit synthesis: Circuit synthesis involves converting a high-level description of desired logic into an optimized gate-level netlist implementation. This complex process has many design considerations and tradeoffs. ML agents can be trained through reinforcement learning to explore the design space and output optimized syntheses automatically. Startups like Symbiotic EDA are bringing this technology to market.\nAutomated chip floorplanning: Floorplanning refers to strategically placing different components on a chip die area. ML techniques like genetic algorithms can be used to automate floorplan optimization to minimize wire length, power consumption, and other objectives. This is extremely valuable as chip complexity increases.\nML hardware simulators: Training deep neural network models to predict how hardware designs will perform as simulators can accelerate the simulation process by over 100x compared to traditional RTL simulations.\nAutomated code translation: Converting hardware description languages like Verilog to optimized RTL implementations is critical but time-consuming. ML models can be trained to act as translator agents and automate parts of this process.\n\nThe benefits of HDA using ML are reduced design time, superior optimizations, and exploration of design spaces too complex for manual approaches. This can accelerate hardware development and lead to better designs.\nChallenges include limits of ML generalization, the black-box nature of some techniques, and accuracy tradeoffs. But research is rapidly advancing to address these issues and make HDA ML solutions robust and reliable for production use. HDA provides a major avenue for ML to transform hardware design.\n\n\n11.9.2 ML-Based Hardware Simulation and Verification\nSimulating and verifying hardware designs is critical before manufacturing to ensure the design behaves as intended. Traditional approaches like register-transfer level (RTL) simulation are complex and time-consuming. ML introduces new opportunities to enhance hardware simulation and verification. Some examples include:\n\nSurrogate modeling for simulation: Highly accurate surrogate models of a design can be built using neural networks. These models predict outputs from inputs much faster than RTL simulation, enabling fast design space exploration. Companies like Ansys use this technique.\nML simulators: Large neural network models can be trained on RTL simulations to learn to mimic the functionality of a hardware design. Once trained, the NN model can act as a highly efficient simulator to use for regression testing and other tasks. Graphcore has demonstrated over 100x speedup with this approach.\nFormal verification using ML: Formal verification mathematically proves properties about a design. ML techniques can help generate verification properties and can learn to solve the complex formal proofs needed. This automates parts of this challenging process. Startups like Cortical.io are bringing ML formal verification solutions to market.\nBug detection: ML models can be trained to process hardware designs and identify potential issues. This assists human designers in inspecting complex designs and finding bugs. Facebook has shown bug detection models for their server hardware.\n\nThe key benefits of applying ML to simulation and verification are faster design validation turnaround times, more rigorous testing, and reduced human effort. Challenges include verifying ML model correctness and handling corner cases. ML promises to significantly accelerate testing workflows.\n\n\n11.9.3 ML for Efficient Hardware Architectures\nDesigning hardware architectures optimized for performance, power, and efficiency is a key goal. ML introduces new techniques to automate and enhance architecture design space exploration for both general-purpose and specialized hardware like ML accelerators. Some promising examples include:\n\nNeural architecture search for hardware: Search techniques like evolutionary algorithms can automatically generate novel hardware architectures by mutating and mixing design attributes like cache size, number of parallel units, memory bandwidth, and so on. This expands the design space beyond human limitations.\nML-based architecture optimizers: ML agents can be trained with reinforcement learning to tweak architectures to optimize for desired objectives like throughput or power. The agent explores the space of possible configurations to find high-performing, efficient designs.\nPredictive modeling for optimization: - ML models can be trained to predict hardware performance, power, and efficiency metrics for a given architecture. These become “surrogate models” for fast optimization and space exploration by substituting lengthy simulations.\nSpecialized accelerator optimization: - For specialized chips like tensor processing units for AI, automated architecture search techniques based on ML/evolutionary algorithms show promise for finding fast, efficient designs.\n\nThe benefits of using ML include superior design space exploration, automated optimization, and reduced manual effort. Challenges include long training times for some techniques and local optima limitations. But ML for hardware architecture holds great potential for unlocking performance and efficiency gains.\n\n\n11.9.4 ML to Optimize Manufacturing and Reduce Defects\nOnce a hardware design is complete, it moves to manufacturing. But variability and defects during manufacturing can impact yields and quality. ML techniques are now being applied to improve fabrication processes and reduce defects. Some examples include:\n\nPredictive maintenance: ML models can analyze equipment sensor data over time and identify signals that predict maintenance needs before failure. This enables proactive upkeep that can come in very handy in the costly fabrication process.\nProcess optimization: Supervised learning models can be trained on process data to identify factors that lead to low yields. The models can then optimize parameters to improve yields, throughput, or consistency.\nYield prediction: By analyzing test data from fabricated designs using techniques like regression trees, ML models can predict yields early in production. This allows process adjustments.\nDefect detection: Computer vision ML techniques can be applied to images of designs to identify defects invisible to the human eye. This enables precision quality control and root cause analysis.\nProactive failure analysis: - By analyzing structured and unstructured process data, ML models can help predict, diagnose, and prevent issues that lead to downstream defects and failures.\n\nApplying ML to manufacturing enables process optimization, real-time quality control, predictive maintenance, and ultimately higher yields. Challenges include managing complex manufacturing data and variations. But ML is poised to transform semiconductor manufacturing.\n\n\n11.9.5 Toward Foundation Models for Hardware Design\nAs we have seen, machine learning is opening up new possibilities across the hardware design workflow, from specification to manufacturing. However, current ML techniques are still narrow in scope and require extensive domain-specific engineering. The long-term vision is the development of general artificial intelligence systems that can be applied with versatility across hardware design tasks.\nTo fully realize this vision, investment and research are needed to develop foundation models for hardware design. These are unified, general-purpose ML models and architectures that can learn complex hardware design skills with the right training data and objectives.\nRealizing foundation models for end-to-end hardware design will require:\n\nAccumulation of large, high-quality, labeled datasets across hardware design stages to train foundation models.\nAdvances in multi-modal, multi-task ML techniques to handle the diversity of hardware design data and tasks.\nInterfaces and abstraction layers to connect foundation models to existing design flows and tools.\nDevelopment of simulation environments and benchmarks to train and test foundation models on hardware design capabilities.\nMethods to explain and interpret the design decisions and optimizations made by ML models for trust and verification.\nCompilation techniques to optimize foundation models for efficient deployment across hardware platforms.\n\nWhile significant research remains, foundation models represent the most transformative long-term goal for imbuing AI into the hardware design process. Democratizing hardware design via versatile, automated ML systems promises to unlock a new era of optimized, efficient, and innovative chip design. The journey ahead is filled with open challenges and opportunities.\nWe encourage you to read Architecture 2.0 if ML-aided computer architecture design (Krishnan et al. 2023) interests you. Alternatively, you can watch the below video.\n\nKrishnan, Srivatsan, Amir Yazdanbakhsh, Shvetank Prakash, Jason Jabbour, Ikechukwu Uchendu, Susobhan Ghosh, Behzad Boroujerdian, et al. 2023. “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design.” In Proceedings of the 50th Annual International Symposium on Computer Architecture, 1–16."
},
{
"objectID": "hw_acceleration.html#conclusion",
@@ -1607,7 +1607,7 @@
"href": "references.html",
"title": "References",
"section": "",
- "text": "Abadi, Martin, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya\nMironov, Kunal Talwar, and Li Zhang. 2016. “Deep Learning with\nDifferential Privacy.” In Proceedings of the 2016 ACM SIGSAC\nConference on Computer and Communications Security, 308–18.\n\n\nAbadi, Martı́n, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,\nJeffrey Dean, Matthieu Devin, et al. 2016. “{TensorFlow}: A System for {Large-Scale} Machine Learning.” In 12th\nUSENIX Symposium on Operating Systems Design and Implementation (OSDI\n16), 265–83.\n\n\nAdolf, Robert, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David\nBrooks. 2016. “Fathom: Reference Workloads for Modern Deep\nLearning Methods.” In 2016 IEEE International Symposium on\nWorkload Characterization (IISWC), 1–10. IEEE.\n\n\nAledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020.\n“Federated Learning: A Survey on Enabling Technologies, Protocols,\nand Applications.” IEEE Access 8: 140699–725. https://doi.org/10.1109/access.2020.3013541.\n\n\nAltayeb, Moez, Marco Zennaro, and Marcelo Rovai. 2022.\n“Classifying Mosquito Wingbeat Sound Using TinyML.” In\nProceedings of the 2022 ACM Conference on Information Technology for\nSocial Good, 132–37.\n\n\nAntol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv\nBatra, C Lawrence Zitnick, and Devi Parikh. 2015. “Vqa: Visual\nQuestion Answering.” In Proceedings of the IEEE International\nConference on Computer Vision, 2425–33.\n\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael\nKohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers,\nand Gregor Weber. 2020. “Common Voice: A Massively-Multilingual\nSpeech Corpus.” Proceedings of the 12th Conference on\nLanguage Resources and Evaluation, May, 4218–22.\n\n\nARM.com. n.d. “The Future Is Being Built on Arm: Market\nDiversification Continues to Drive Strong Royalty and Licensing Growth\nas Ecosystem Reaches Quarter of a Trillion Chips Milestone –\nArm®.” https://www.arm.com/company/news/2023/02/arm-announces-q3-fy22-results.\n\n\nBains, Sunny. 2020. “The Business of Building Brains.”\nNat. Electron 3 (7): 348–51.\n\n\nBamoumen, Hatim, Anas Temouden, Nabil Benamar, and Yousra Chtouki. 2022.\n“How TinyML Can Be Leveraged to Solve Environmental Problems: A\nSurvey.” In 2022 International Conference on Innovation and\nIntelligence for Informatics, Computing, and Technologies (3ICT),\n338–43. IEEE.\n\n\nBanbury, Colby R, Vijay Janapa Reddi, Max Lam, William Fu, Amin Fazel,\nJeremy Holleman, Xinyuan Huang, et al. 2020. “Benchmarking Tinyml\nSystems: Challenges and Direction.” arXiv Preprint\narXiv:2003.04821.\n\n\nBank, Dor, Noam Koenigstein, and Raja Giryes. 2023.\n“Autoencoders.” Machine Learning for Data Science\nHandbook: Data Mining and Knowledge Discovery Handbook, 353–74.\n\n\nBarroso, Luiz André, Urs Hölzle, and Parthasarathy Ranganathan. 2019.\nThe Datacenter as a Computer: Designing Warehouse-Scale\nMachines. Springer Nature.\n\n\nBender, Emily M., and Batya Friedman. 2018. “Data Statements for\nNatural Language Processing: Toward Mitigating System Bias and Enabling\nBetter Science.” Transactions of the Association for\nComputational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041.\n\n\nBenmeziane, Hadjer, Kaoutar El Maghraoui, Hamza Ouarnoughi, Smail Niar,\nMartin Wistuba, and Naigang Wang. 2021. “Hardware-Aware Neural\nArchitecture Search: Survey and Taxonomy.” In Proceedings of\nthe Thirtieth International Joint Conference on Artificial Intelligence,\nIJCAI-21, edited by Zhi-Hua Zhou, 4322–29.\nInternational Joint Conferences on Artificial Intelligence Organization.\nhttps://doi.org/10.24963/ijcai.2021/592.\n\n\nBeyer, Lucas, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and\nAäron van den Oord. 2020. “Are We Done with Imagenet?”\narXiv Preprint arXiv:2006.07159.\n\n\nBiggs, John, James Myers, Jedrzej Kufel, Emre Ozer, Simon Craske, Antony\nSou, Catherine Ramsdale, Ken Williamson, Richard Price, and Scott White.\n2021. “A Natively Flexible 32-Bit Arm Microprocessor.”\nNature 595 (7868): 532–36.\n\n\nBinkert, Nathan, Bradford Beckmann, Gabriel Black, Steven K Reinhardt,\nAli Saidi, Arkaprava Basu, Joel Hestness, et al. 2011. “The Gem5\nSimulator.” ACM SIGARCH Computer Architecture News 39\n(2): 1–7.\n\n\nBrown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,\nPrafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language\nModels Are Few-Shot Learners.” Advances in Neural Information\nProcessing Systems 33: 1877–1901.\n\n\nBurr, Geoffrey W, Matthew J Brightsky, Abu Sebastian, Huai-Yu Cheng,\nJau-Yi Wu, Sangbum Kim, Norma E Sosa, et al. 2016. “Recent\nProgress in Phase-Change Memory Technology.” IEEE Journal on\nEmerging and Selected Topics in Circuits and Systems 6 (2): 146–62.\n\n\nCai, Han, Chuang Gan, Ligeng Zhu, and Song Han. 2020. “Tinytl:\nReduce Memory, Not Parameters for Efficient on-Device Learning.”\nAdvances in Neural Information Processing Systems 33: 11285–97.\n\n\nCai, Han, Ligeng Zhu, and Song Han. 2018. “Proxylessnas: Direct\nNeural Architecture Search on Target Task and Hardware.”\narXiv Preprint arXiv:1812.00332.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009.\n“Semi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book\nReviews].” IEEE Transactions on Neural Networks 20 (3):\n542–42. https://doi.org/10.1109/tnn.2009.2015974.\n\n\nChen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,\nHaichen Shen, Meghan Cowan, et al. 2018. “{TVM}: An\nAutomated {End-to-End} Optimizing Compiler for Deep\nLearning.” In 13th USENIX Symposium on Operating Systems\nDesign and Implementation (OSDI 18), 578–94.\n\n\nChen, Tianqi, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016.\n“Training Deep Nets with Sublinear Memory Cost.” arXiv\nPreprint arXiv:1604.06174.\n\n\nChen, Zhiyong, and Shugong Xu. 2023. “Learning\nDomain-Heterogeneous Speaker Recognition Systems with Personalized\nContinual Federated Learning.” EURASIP Journal on Audio,\nSpeech, and Music Processing 2023 (1): 33.\n\n\nChen (陈新宇), Xinyu. 2022. “Inpainting Fluid\nDynamics with Tensor\nDecomposition (NumPy).”\nMedium. https://medium.com/@xinyu.chen/inpainting-fluid-dynamics-with-tensor-decomposition-numpy-d84065fead4d.\n\n\nCheng, Yu, Duo Wang, Pan Zhou, and Tao Zhang. 2017. “A Survey of\nModel Compression and Acceleration for Deep Neural Networks.”\narXiv Preprint arXiv:1710.09282.\n\n\nChi, Ping, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu,\nYu Wang, and Yuan Xie. 2016. “Prime: A Novel Processing-in-Memory\nArchitecture for Neural Network Computation in Reram-Based Main\nMemory.” ACM SIGARCH Computer Architecture News 44 (3):\n27–39.\n\n\nChollet, François. 2018. “Introduction to Keras.” March\n9th.\n\n\nChu, Grace, Okan Arikan, Gabriel Bender, Weijun Wang, Achille Brighton,\nPieter-Jan Kindermans, Hanxiao Liu, Berkin Akin, Suyog Gupta, and Andrew\nHoward. 2021. “Discovering Multi-Hardware Mobile Models via\nArchitecture Search.” In Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition, 3022–31. https://arxiv.org/abs/2008.08178.\n\n\nChua, Leon. 1971. “Memristor-the Missing Circuit Element.”\nIEEE Transactions on Circuit Theory 18 (5): 507–19.\n\n\nColeman, Cody, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter\nBailis, Alexander C Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia,\nand I Zeki Yalniz. 2022. “Similarity Search for Efficient Active\nLearning and Search of Rare Concepts.” In Proceedings of the\nAAAI Conference on Artificial Intelligence, 36:6402–10. 6.\n\n\nColeman, Cody, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang,\nLuigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia.\n2017. “Dawnbench: An End-to-End Deep Learning Benchmark and\nCompetition.” Training 100 (101): 102.\n\n\nDavid, Robert, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat\nJeffries, Jian Li, Nick Kreeger, et al. 2021. “Tensorflow Lite\nMicro: Embedded Machine Learning for Tinyml Systems.”\nProceedings of Machine Learning and Systems 3: 800–811.\n\n\nDavies, Mike, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya,\nYongqiang Cao, Sri Harsha Choday, Georgios Dimou, et al. 2018.\n“Loihi: A Neuromorphic Manycore Processor with on-Chip\nLearning.” Ieee Micro 38 (1): 82–99.\n\n\nDavies, Mike, Andreas Wild, Garrick Orchard, Yulia Sandamirskaya,\nGabriel A Fonseca Guerra, Prasad Joshi, Philipp Plank, and Sumedh R\nRisbud. 2021. “Advancing Neuromorphic Computing with Loihi: A\nSurvey of Results and Outlook.” Proceedings of the IEEE\n109 (5): 911–34.\n\n\nDean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark\nMao, Marc’aurelio Ranzato, et al. 2012. “Large Scale Distributed\nDeep Networks.” Advances in Neural Information Processing\nSystems 25.\n\n\nDeng, Jia, R. Socher, Li Fei-Fei, Wei Dong, Kai Li, and Li-Jia Li. 2009.\n“ImageNet: A Large-Scale Hierarchical Image Database.” In\n2009 IEEE Conference on Computer Vision and Pattern\nRecognition(CVPR), 00:248–55. https://doi.org/10.1109/CVPR.2009.5206848.\n\n\nDesai, Tanvi, Felix Ritchie, Richard Welpton, et al. 2016. “Five\nSafes: Designing Data Access for Research.” Economics Working\nPaper Series 1601: 28.\n\n\nDevlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.\n“Bert: Pre-Training of Deep Bidirectional Transformers for\nLanguage Understanding.” arXiv Preprint\narXiv:1810.04805.\n\n\nDhar, Sauptik, Junyao Guo, Jiayi Liu, Samarth Tripathi, Unmesh Kurup,\nand Mohak Shah. 2021. “A Survey of on-Device Machine Learning: An\nAlgorithms and Learning Theory Perspective.” ACM Transactions\non Internet of Things 2 (3): 1–49.\n\n\nDong, Xin, Barbara De Salvo, Meng Li, Chiao Liu, Zhongnan Qu, H. T.\nKung, and Ziyun Li. 2022. “SplitNets: Designing Neural\nArchitectures for Efficient Distributed Computing on Head-Mounted\nSystems.” https://arxiv.org/abs/2204.04705.\n\n\nDongarra, Jack J. 2009. “The Evolution of High Performance\nComputing on System z.” IBM Journal of Research and\nDevelopment 53: 3–4.\n\n\nDuarte, Javier, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi,\nShvetank Prakash, and Vijay Janapa Reddi. 2022. “FastML Science\nBenchmarks: Accelerating Real-Time Scientific Edge Machine\nLearning.” arXiv Preprint arXiv:2207.07958.\n\n\nDuisterhof, Bardienus P, Srivatsan Krishnan, Jonathan J Cruz, Colby R\nBanbury, William Fu, Aleksandra Faust, Guido CHE de Croon, and Vijay\nJanapa Reddi. 2019. “Learning to Seek: Autonomous Source Seeking\nwith Deep Reinforcement Learning Onboard a Nano Drone\nMicrocontroller.” arXiv Preprint arXiv:1909.11236.\n\n\nDuisterhof, Bardienus P, Shushuai Li, Javier Burgués, Vijay Janapa\nReddi, and Guido CHE de Croon. 2021. “Sniffy Bug: A Fully\nAutonomous Swarm of Gas-Seeking Nano Quadcopters in Cluttered\nEnvironments.” In 2021 IEEE/RSJ International Conference on\nIntelligent Robots and Systems (IROS), 9099–9106. IEEE.\n\n\nDwork, Cynthia, Aaron Roth, et al. 2014. “The Algorithmic\nFoundations of Differential Privacy.” Foundations and\nTrends in Theoretical Computer Science 9 (3–4):\n211–407.\n\n\nEshraghian, Jason K., Max Ward, Emre O. Neftci, Xinxin Wang, Gregor\nLenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D. Lu.\n2023. “Training Spiking Neural Networks Using Lessons from Deep\nLearning.” Proceedings of the IEEE 111 (9): 1016–54. https://doi.org/10.1109/JPROC.2023.3308088.\n\n\nEsteva, Andre, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M\nSwetter, Helen M Blau, and Sebastian Thrun. 2017.\n“Dermatologist-Level Classification of Skin Cancer with Deep\nNeural Networks.” Nature 542 (7639): 115–18.\n\n\nFahim, Farah, Benjamin Hawks, Christian Herwig, James Hirschauer, Sergo\nJindariani, Nhan Tran, Luca P. Carloni, et al. 2021. “Hls4ml: An\nOpen-Source Codesign Workflow to Empower Scientific Low-Power Machine\nLearning Devices.” https://arxiv.org/abs/2103.05579.\n\n\nFarah, Martha J. 2005. “Neuroethics: The Practical and the\nPhilosophical.” Trends in Cognitive Sciences 9 (1):\n34–40.\n\n\nFowers, Jeremy, Kalin Ovtcharov, Michael Papamichael, Todd Massengill,\nMing Liu, Daniel Lo, Shlomi Alkalay, et al. 2018. “A Configurable\nCloud-Scale DNN Processor for Real-Time AI.” In 2018 ACM/IEEE\n45th Annual International Symposium on Computer Architecture\n(ISCA), 1–14. IEEE.\n\n\nFrankle, Jonathan, and Michael Carbin. 2019. “The\nLottery Ticket Hypothesis:\nFinding Sparse, Trainable\nNeural Networks.” arXiv. https://doi.org/10.48550/arXiv.1803.03635.\n\n\nFurber, Steve. 2016. “Large-Scale Neuromorphic Computing\nSystems.” Journal of Neural Engineering 13 (5): 051001.\n\n\nGale, Trevor, Erich Elsen, and Sara Hooker. 2019. “The State of\nSparsity in Deep Neural Networks.” arXiv Preprint\narXiv:1902.09574.\n\n\nGannot, G., and M. Ligthart. 1994. “Verilog HDL Based FPGA\nDesign.” In International Verilog HDL Conference, 86–92.\nhttps://doi.org/10.1109/IVC.1994.323743.\n\n\nGates, Byron D. 2009. “Flexible Electronics.”\nScience 323 (5921): 1566–67.\n\n\nGaviria Rojas, William, Sudnya Diamos, Keertan Kini, David Kanter, Vijay\nJanapa Reddi, and Cody Coleman. 2022. “The Dollar Street Dataset:\nImages Representing the Geographic and Socioeconomic Diversity of the\nWorld.” Advances in Neural Information Processing\nSystems 35: 12979–90.\n\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman\nVaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021.\n“Datasheets for Datasets.” Communications of the\nACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\nGholami, Dong Kim, Mahoney Yao, and Keutzer. 2021. “A Survey of\nQuantization Methods for Efficient Neural Network Inference).” https://doi.org/10.48550/arXiv.2103.13630.\n\n\nGoodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David\nWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020.\n“Generative Adversarial Networks.” Communications of\nthe ACM 63 (11): 139–44.\n\n\nGoodyear, Victoria A. 2017. “Social Media, Apps and Wearable\nTechnologies: Navigating Ethical Dilemmas and Procedures.”\nQualitative Research in Sport, Exercise and Health 9 (3):\n285–302.\n\n\nGoogle. 2023. “Three Floating Point Formats.” https://storage.googleapis.com/gweb-cloudblog-publish/images/Three_floating-point_formats.max-624x261.png.\n\n\n———. n.d. “Information Quality & Content Moderation.”\nhttps://blog.google/documents/83/.\n\n\nGordon, Ariel, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang,\nand Edward Choi. 2018. “Morphnet: Fast & Simple\nResource-Constrained Structure Learning of Deep Networks.” In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 1586–95.\n\n\nGruslys, Audrunas, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex\nGraves. 2016. “Memory-Efficient Backpropagation Through\nTime.” Advances in Neural Information Processing Systems\n29.\n\n\nGu, Ivy. 2023. “Deep Learning Model\nCompression (Ii) by Ivy\nGu Medium.” https://ivygdy.medium.com/deep-learning-model-compression-ii-546352ea9453.\n\n\nGwennap, Linley. n.d. “Certus-NX\nInnovates General-Purpose\nFPGAs.”\n\n\nHaensch, Wilfried, Tayfun Gokmen, and Ruchir Puri. 2018. “The Next\nGeneration of Deep Learning Hardware: Analog Computing.”\nProceedings of the IEEE 107 (1): 108–22.\n\n\nHan, Song, Huizi Mao, and William J. Dally. 2016. “Deep\nCompression: Compressing Deep Neural Networks with Pruning, Trained\nQuantization and Huffman Coding.” https://arxiv.org/abs/1510.00149.\n\n\nHan, Mao, and Dally. 2016. “Deep Compression: Compressing Deep\nNeural Networks with Pruning, Trained Quantization and Huffman\nCoding.” https://doi.org/10.48550/arXiv.1510.00149.\n\n\nHazan, Avi, and Elishai Ezra Tsur. 2021. “Neuromorphic Analog\nImplementation of Neural Engineering Framework-Inspired Spiking Neuron\nfor High-Dimensional Representation.” Frontiers in\nNeuroscience 15: 627221.\n\n\nHe, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.\n“Deep Residual Learning for Image Recognition.” In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 770–78.\n\n\nHegde, Sumant. 2023. “An Introduction to\nSeparable Convolutions -\nAnalytics Vidhya.” https://www.analyticsvidhya.com/blog/2021/11/an-introduction-to-separable-convolutions/.\n\n\nHendrycks, Dan, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn\nSong. 2021. “Natural Adversarial Examples.” In\nProceedings of the IEEE/CVF Conference on Computer Vision and\nPattern Recognition, 15262–71.\n\n\nHennessy, John L, and David A Patterson. 2019. “A New Golden Age\nfor Computer Architecture.” Commun. ACM 62 (2): 48–60.\n\n\nHinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling\nthe Knowledge in a Neural Network.” https://arxiv.org/abs/1503.02531.\n\n\nHolland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia\nChmielinski. 2020. “The Dataset Nutrition Label.” Data\nProtection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.\n\n\nHong, Sanghyun, Nicholas Carlini, and Alexey Kurakin. 2023.\n“Publishing Efficient on-Device Models Increases Adversarial\nVulnerability.” In 2023 IEEE Conference on Secure and\nTrustworthy Machine Learning (SaTML), 271–90. IEEE.\n\n\nHoward, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun\nWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017a.\n“MobileNets: Efficient Convolutional Neural Networks for Mobile\nVision Applications.” arXiv Preprint arXiv:1704.04861.\nhttps://arxiv.org/abs/1704.04861.\n\n\n———. 2017b. “MobileNets: Efficient\nConvolutional Neural Networks for\nMobile Vision\nApplications.” arXiv. https://doi.org/10.48550/arXiv.1704.04861.\n\n\nHuang, Tsung-Ching, Kenjiro Fukuda, Chun-Ming Lo, Yung-Hui Yeh, Tsuyoshi\nSekitani, Takao Someya, and Kwang-Ting Cheng. 2010. “Pseudo-CMOS:\nA Design Style for Low-Cost and Robust Flexible Electronics.”\nIEEE Transactions on Electron Devices 58 (1): 141–50.\n\n\nIandola, Forrest N, Song Han, Matthew W Moskewicz, Khalid Ashraf,\nWilliam J Dally, and Kurt Keutzer. 2016. “SqueezeNet:\nAlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model\nSize.” arXiv Preprint arXiv:1602.07360.\n\n\nIgnatov, Andrey, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim\nHartley, and Luc Van Gool. 2018a. “AI Benchmark:\nRunning Deep Neural Networks on Android Smartphones.”\n\n\n———. 2018b. “Ai Benchmark: Running Deep Neural Networks on Android\nSmartphones.” In Proceedings of the European Conference on\nComputer Vision (ECCV) Workshops, 0–0.\n\n\nImani, Mohsen, Abbas Rahimi, and Tajana S Rosing. 2016. “Resistive\nConfigurable Associative Memory for Approximate Computing.” In\n2016 Design, Automation & Test in Europe Conference &\nExhibition (DATE), 1327–32. IEEE.\n\n\nIntelLabs. 2023. “Knowledge Distillation -\nNeural Network Distiller.”\nhttps://intellabs.github.io/distiller/knowledge_distillation.html.\n\n\nISSCC. 2014. “Computing’s Energy Problem (and What We Can Do about\nIt).” https://ieeexplore.ieee.org/document/6757323.\n\n\nJacob, Benoit, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang,\nAndrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018.\n“Quantization and Training of Neural Networks for Efficient\nInteger-Arithmetic-Only Inference.” In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition,\n2704–13.\n\n\nJia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan\nLong, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014.\n“Caffe: Convolutional Architecture for Fast Feature\nEmbedding.” In Proceedings of the 22nd ACM International\nConference on Multimedia, 675–78.\n\n\nJia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza.\n2018. “Dissecting the NVIDIA Volta\nGPU Architecture via\nMicrobenchmarking.” arXiv. http://arxiv.org/abs/1804.06826.\n\n\nJia, Zhenge, Dawei Li, Xiaowei Xu, Na Li, Feng Hong, Lichuan Ping, and\nYiyu Shi. 2023. “Life-Threatening Ventricular Arrhythmia Detection\nChallenge in Implantable Cardioverter–Defibrillators.” Nature\nMachine Intelligence 5 (5): 554–55.\n\n\nJia, Zhihao, Matei Zaharia, and Alex Aiken. 2019. “Beyond Data and\nModel Parallelism for Deep Neural Networks.” Proceedings of\nMachine Learning and Systems 1: 1–13.\n\n\nJiang, Weiwen, Xinyi Zhang, Edwin H. -M. Sha, Lei Yang, Qingfeng Zhuge,\nYiyu Shi, and Jingtong Hu. 2019. “Accuracy Vs. Efficiency:\nAchieving Both Through FPGA-Implementation Aware Neural Architecture\nSearch.” https://arxiv.org/abs/1901.11211.\n\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur\nSridhar, Karl Rosaen, and Ram Vasudevan. 2017. “Driving in the\nMatrix: Can Virtual Worlds Replace Human-Generated Annotations for Real\nWorld Tasks?” 2017 IEEE International Conference on Robotics\nand Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\n\nJouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017a. “In-Datacenter\nPerformance Analysis of a Tensor Processing Unit.” In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1–12. ISCA ’17. New York, NY, USA: Association for\nComputing Machinery. https://doi.org/10.1145/3079856.3080246.\n\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017b. “In-Datacenter\nPerformance Analysis of a Tensor Processing Unit.” In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1–12.\n\n\nJouppi, Norm, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng\nNai, Nishant Patil, et al. 2023. “TPU V4: An Optically\nReconfigurable Supercomputer for Machine Learning with Hardware Support\nfor Embeddings.” In Proceedings of the 50th Annual\nInternational Symposium on Computer Architecture. ISCA ’23. New\nYork, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3579371.3589350.\n\n\nKairouz, Peter, Sewoong Oh, and Pramod Viswanath. 2015. “Secure\nMulti-Party Differential Privacy.” Advances in Neural\nInformation Processing Systems 28.\n\n\nKarargyris, Alexandros, Renato Umeton, Micah J Sheller, Alejandro\nAristizabal, Johnu George, Anna Wuest, Sarthak Pati, et al. 2023.\n“Federated Benchmarking of Medical Artificial Intelligence with\nMedPerf.” Nature Machine Intelligence 5 (7): 799–810.\n\n\nKiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger,\nZhengxuan Wu, Bertie Vidgen, et al. 2021. “Dynabench: Rethinking\nBenchmarking in NLP.” arXiv Preprint arXiv:2104.14337.\n\n\nKoh, Pang Wei, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin\nZhang, Akshay Balsubramani, Weihua Hu, et al. 2021. “Wilds: A\nBenchmark of in-the-Wild Distribution Shifts.” In\nInternational Conference on Machine Learning, 5637–64. PMLR.\n\n\nKrishna, Adithya, Srikanth Rohit Nudurupati, Chandana D G, Pritesh\nDwivedi, André van Schaik, Mahesh Mehendale, and Chetan Singh Thakur.\n2023. “RAMAN: A Re-Configurable and Sparse tinyML Accelerator for\nInference on Edge.” https://arxiv.org/abs/2306.06493.\n\n\nKrishnamoorthi. 2018. “Quantizing Deep Convolutional Networks for\nEfficient Inference: A Whitepaper.” arXiv. https://doi.org/10.48550/arXiv.1806.08342.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022.\n“Self-Supervised Learning in Medicine and Healthcare.”\nNature Biomedical Engineering 6 (12): 1346–52. https://doi.org/10.1038/s41551-022-00914-1.\n\n\nKrizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.\n“Imagenet Classification with Deep Convolutional Neural\nNetworks.” Advances in Neural Information Processing\nSystems 25.\n\n\nKung, H. T., Bradley McDanel, and Sai Qian Zhang. 2018. “Packing\nSparse Convolutional Neural Networks for Efficient Systolic Array\nImplementations: Column Combining Under Joint Optimization.” https://arxiv.org/abs/1811.04770.\n\n\nKung, Hsiang Tsung, and Charles E Leiserson. 1979. “Systolic\nArrays (for VLSI).” In Sparse Matrix Proceedings 1978,\n1:256–82. Society for industrial; applied mathematics Philadelphia, PA,\nUSA.\n\n\nKuzmin, Andrey, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters,\nand Tijmen Blankevoort. 2022. “FP8 Quantization: The Power of the\nExponent.” https://arxiv.org/abs/2208.09225.\n\n\nKwon, Jisu, and Daejin Park. 2021. “Hardware/Software Co-Design\nfor TinyML Voice-Recognition Application on Resource Frugal Edge\nDevices.” Applied Sciences 11 (22). https://doi.org/10.3390/app112211073.\n\n\nKwon, Sun Hwa, and Lin Dong. 2022. “Flexible Sensors and Machine\nLearning for Heart Monitoring.” Nano Energy, 107632.\n\n\nKwon, Young D, Rui Li, Stylianos I Venieris, Jagmohan Chauhan, Nicholas\nD Lane, and Cecilia Mascolo. 2023. “TinyTrain: Deep Neural Network\nTraining at the Extreme Edge.” arXiv Preprint\narXiv:2307.09988.\n\n\nLai, Liangzhen, Naveen Suda, and Vikas Chandra. 2018a. “Cmsis-Nn:\nEfficient Neural Network Kernels for Arm Cortex-m Cpus.”\narXiv Preprint arXiv:1801.06601.\n\n\n———. 2018b. “CMSIS-NN: Efficient Neural Network Kernels for Arm\nCortex-m CPUs.” https://arxiv.org/abs/1801.06601.\n\n\nLeCun, Yann, John Denker, and Sara Solla. 1989. “Optimal Brain\nDamage.” Advances in Neural Information Processing\nSystems 2.\n\n\nLi, En, Liekang Zeng, Zhi Zhou, and Xu Chen. 2019. “Edge AI:\nOn-Demand Accelerating Deep Neural Network Inference via Edge\nComputing.” IEEE Transactions on Wireless Communications\n19 (1): 447–57.\n\n\nLi, Mu, David G Andersen, Alexander J Smola, and Kai Yu. 2014.\n“Communication Efficient Distributed Machine Learning with the\nParameter Server.” Advances in Neural Information Processing\nSystems 27.\n\n\nLi, Xiang, Tao Qin, Jian Yang, and Tie-Yan Liu. 2016. “LightRNN:\nMemory and Computation-Efficient Recurrent Neural Networks.”\nAdvances in Neural Information Processing Systems 29.\n\n\nLi, Yuhang, Xin Dong, and Wei Wang. 2020. “Additive Powers-of-Two\nQuantization: An Efficient Non-Uniform Discretization for Neural\nNetworks.” In International Conference on Learning\nRepresentations. https://openreview.net/forum?id=BkgXT24tDS.\n\n\nLi, Zhizhong, and Derek Hoiem. 2017. “Learning Without\nForgetting.” IEEE Transactions on Pattern Analysis and\nMachine Intelligence 40 (12): 2935–47.\n\n\nLin, Ji, Wei-Ming Chen, Yujun Lin, Chuang Gan, Song Han, et al. 2020.\n“Mcunet: Tiny Deep Learning on Iot Devices.” Advances\nin Neural Information Processing Systems 33: 11711–22. https://arxiv.org/abs/2007.10319.\n\n\nLin, Ji, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song\nHan. 2023. “AWQ: Activation-Aware Weight Quantization for LLM\nCompression and Acceleration.” arXiv.\n\n\nLin, Ji, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song\nHan. 2022a. “On-Device Training Under 256kb Memory.”\nAdvances in Neural Information Processing Systems 35: 22941–54.\n\n\n———. 2022b. “On-Device Training Under 256KB Memory.” In\nArXiv.\n\n\nLin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona,\nDeva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014.\n“Microsoft Coco: Common Objects in Context.” In\nComputer Vision–ECCV 2014: 13th European Conference, Zurich,\nSwitzerland, September 6-12, 2014, Proceedings, Part v 13, 740–55.\nSpringer.\n\n\nLindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008.\n“NVIDIA Tesla: A\nUnified Graphics and Computing\nArchitecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/MM.2008.31.\n\n\nLin, Tang Tang, Dang Yang, and Han Gan. 2023. “AWQ:\nActivation-Aware Weight Quantization for LLM Compression and\nAcceleration.” https://doi.org/10.48550/arXiv.2306.00978.\n\n\nLoh, Gabriel H. 2008. “3D-Stacked Memory Architectures for\nMulti-Core Processors.” ACM SIGARCH Computer Architecture\nNews 36 (3): 453–64.\n\n\nLuebke, David. 2008. “CUDA: Scalable Parallel Programming for\nHigh-Performance Scientific Computing.” In 2008 5th IEEE\nInternational Symposium on Biomedical Imaging: From Nano to Macro,\n836–38. https://doi.org/10.1109/ISBI.2008.4541126.\n\n\nLundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to\nInterpreting Model Predictions.” Advances in Neural\nInformation Processing Systems 30.\n\n\nMaass, Wolfgang. 1997. “Networks of Spiking Neurons: The Third\nGeneration of Neural Network Models.” Neural Networks 10\n(9): 1659–71.\n\n\nMarković, Danijela, Alice Mizrahi, Damien Querlioz, and Julie Grollier.\n2020. “Physics for Neuromorphic Computing.” Nature\nReviews Physics 2 (9): 499–510.\n\n\nMattson, Peter, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius\nMicikevicius, David Patterson, Hanlin Tang, et al. 2020. “Mlperf\nTraining Benchmark.” Proceedings of Machine Learning and\nSystems 2: 336–49.\n\n\nMcMahan, Brendan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise\nAguera y Arcas. 2017. “Communication-Efficient Learning of Deep\nNetworks from Decentralized Data.” In Artificial Intelligence\nand Statistics, 1273–82. PMLR.\n\n\nMiller, David AB. 2000. “Optical Interconnects to Silicon.”\nIEEE Journal of Selected Topics in Quantum Electronics 6 (6):\n1312–17.\n\n\nMittal, Sparsh, Gaurav Verma, Brajesh Kaushik, and Farooq A Khanday.\n2021. “A Survey of SRAM-Based in-Memory Computing Techniques and\nApplications.” Journal of Systems Architecture 119:\n102276.\n\n\nModha, Dharmendra S, Filipp Akopyan, Alexander Andreopoulos,\nRathinakumar Appuswamy, John V Arthur, Andrew S Cassidy, Pallab Datta,\net al. 2023. “Neural Inference at the Frontier of Energy, Space,\nand Time.” Science 382 (6668): 329–35.\n\n\nMoshawrab, Mohammad, Mehdi Adda, Abdenour Bouzouane, Hussein Ibrahim,\nand Ali Raad. 2023. “Reviewing Federated Learning Aggregation\nAlgorithms; Strategies, Contributions, Limitations and Future\nPerspectives.” Electronics 12 (10): 2287.\n\n\nMunshi, Aaftab. 2009. “The OpenCL Specification.” In\n2009 IEEE Hot Chips 21 Symposium (HCS), 1–314. https://doi.org/10.1109/HOTCHIPS.2009.7478342.\n\n\nMusk, Elon et al. 2019. “An Integrated Brain-Machine Interface\nPlatform with Thousands of Channels.” Journal of Medical\nInternet Research 21 (10): e16194.\n\n\nNguyen, Ngoc-Bao, Keshigeyan Chandrasegaran, Milad Abdollahzadeh, and\nNgai-Man Cheung. 2023. “Re-Thinking Model Inversion Attacks\nAgainst Deep Neural Networks.” In Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition, 16384–93.\n\n\nNorrie, Thomas, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li,\nJames Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021.\n“The Design Process for Google’s Training Chips: TPUv2 and\nTPUv3.” IEEE Micro 41 (2): 56–63. https://doi.org/10.1109/MM.2021.3058217.\n\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021.\n“Pervasive Label Errors in Test Sets Destabilize Machine Learning\nBenchmarks.” arXiv, March. https://doi.org/ \nhttps://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via\nDataCite.\n\n\nOoko, Samson Otieno, Marvin Muyonga Ogore, Jimmy Nsenga, and Marco\nZennaro. 2021. “TinyML in Africa: Opportunities and\nChallenges.” In 2021 IEEE Globecom Workshops (GC\nWkshps), 1–6. IEEE.\n\n\nPan, Sinno Jialin, and Qiang Yang. 2009. “A Survey on Transfer\nLearning.” IEEE Transactions on Knowledge and Data\nEngineering 22 (10): 1345–59.\n\n\nPaszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,\nGregory Chanan, Trevor Killeen, et al. 2019. “Pytorch: An\nImperative Style, High-Performance Deep Learning Library.”\nAdvances in Neural Information Processing Systems 32.\n\n\nPatterson, David A, and John L Hennessy. 2016. Computer Organization\nand Design ARM Edition: The Hardware Software Interface. Morgan\nkaufmann.\n\n\nPrakash, Shvetank, Tim Callahan, Joseph Bushagour, Colby Banbury, Alan\nV. Green, Pete Warden, Tim Ansell, and Vijay Janapa Reddi. 2023.\n“CFU Playground: Full-Stack Open-Source Framework for\nTiny Machine Learning (TinyML) Acceleration on\nFPGAs.” In 2023 IEEE International\nSymposium on Performance Analysis of Systems and Software\n(ISPASS). IEEE. https://doi.org/10.1109/ispass57527.2023.00024.\n\n\nPushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022.\n“Data Cards: Purposeful and Transparent Dataset Documentation for\nResponsible Ai.” 2022 ACM Conference on Fairness,\nAccountability, and Transparency. https://doi.org/10.1145/3531146.3533231.\n\n\nPutnam, Andrew, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros\nConstantinides, John Demme, Hadi Esmaeilzadeh, et al. 2014. “A\nReconfigurable Fabric for Accelerating Large-Scale Datacenter\nServices.” ACM SIGARCH Computer Architecture News 42\n(3): 13–24. https://doi.org/10.1145/2678373.2665678.\n\n\nQi, Chen, Shibo Shen, Rongpeng Li, Zhao Zhifeng, Qing Liu, Jing Liang,\nand Honggang Zhang. 2021. “An Efficient Pruning Scheme of Deep\nNeural Networks for Internet of Things\nApplications.” EURASIP Journal on Advances in Signal\nProcessing 2021 (June). https://doi.org/10.1186/s13634-021-00744-4.\n\n\nRaina, Rajat, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-Scale\nDeep Unsupervised Learning Using Graphics Processors.” In\nProceedings of the 26th Annual\nInternational Conference on\nMachine Learning, 873–80. Montreal Quebec\nCanada: ACM. https://doi.org/10.1145/1553374.1553486.\n\n\nRamcharan, Amanda, Kelsee Baranowski, Peter McCloskey, Babuali Ahmed,\nJames Legg, and David P Hughes. 2017. “Deep Learning for\nImage-Based Cassava Disease Detection.” Frontiers in Plant\nScience 8: 1852.\n\n\nRanganathan, Parthasarathy. 2011. “From Microprocessors to\nNanostores: Rethinking Data-Centric Systems.” Computer (Long\nBeach Calif.) 44 (1): 39–48.\n\n\nRao, Ravi. 2021. Www.wevolver.com. https://www.wevolver.com/article/tinyml-unlocks-new-possibilities-for-sustainable-development-technologies.\n\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and\nChristopher Ré. 2018. “Snorkel Metal: Weak Supervision for\nMulti-Task Learning.” Proceedings of the Second Workshop on\nData Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\n\n\nReddi, Vijay Janapa, Christine Cheng, David Kanter, Peter Mattson,\nGuenther Schmuelling, Carole-Jean Wu, Brian Anderson, et al. 2020.\n“Mlperf Inference Benchmark.” In 2020 ACM/IEEE 47th\nAnnual International Symposium on Computer Architecture (ISCA),\n446–59. IEEE.\n\n\nRibeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “\"\nWhy Should i Trust You?\" Explaining the Predictions of Any\nClassifier.” In Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining,\n1135–44.\n\n\nRosenblatt, Frank. 1957. The Perceptron, a Perceiving and\nRecognizing Automaton Project Para. Cornell Aeronautical\nLaboratory.\n\n\nRoskies, Adina. 2002. “Neuroethics for the New Millenium.”\nNeuron 35 (1): 21–23.\n\n\nRouhani, Bita, Azalia Mirhoseini, and Farinaz Koushanfar. 2017.\n“TinyDL: Just-in-Time Deep Learning Solution for Constrained\nEmbedded Systems.” In, 1–4. https://doi.org/10.1109/ISCAS.2017.8050343.\n\n\nRumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986.\n“Learning Representations by Back-Propagating Errors.”\nNature 323 (6088): 533–36.\n\n\nSamajdar, Ananda, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar\nKrishna. 2018. “Scale-Sim: Systolic Cnn Accelerator\nSimulator.” arXiv Preprint arXiv:1811.02883.\n\n\nSchuman, Catherine D, Shruti R Kulkarni, Maryam Parsa, J Parker\nMitchell, Prasanna Date, and Bill Kay. 2022. “Opportunities for\nNeuromorphic Computing Algorithms and Applications.” Nature\nComputational Science 2 (1): 10–19.\n\n\nSegal, Mark, and Kurt Akeley. 1999. “The OpenGL Graphics System: A\nSpecification (Version 1.1).”\n\n\nSegura Anaya, LH, Abeer Alsadoon, Nectar Costadopoulos, and PWC Prasad.\n2018. “Ethical Implications of User Perceptions of Wearable\nDevices.” Science and Engineering Ethics 24: 1–28.\n\n\nSeide, Frank, and Amit Agarwal. 2016. “CNTK: Microsoft’s\nOpen-Source Deep-Learning Toolkit.” In Proceedings of the\n22nd ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, 2135–35.\n\n\nSeyedzadeh, Saleh, Farzad Pour Rahimian, Ivan Glesk, and Marc Roper.\n2018. “Machine Learning for Estimation of Building Energy\nConsumption and Performance: A Review.” Visualization in\nEngineering 6: 1–20.\n\n\nShastri, Bhavin J, Alexander N Tait, Thomas Ferreira de Lima, Wolfram HP\nPernice, Harish Bhaskaran, C David Wright, and Paul R Prucnal. 2021.\n“Photonics for Artificial Intelligence and Neuromorphic\nComputing.” Nature Photonics 15 (2): 102–14.\n\n\nSheng, Victor S., and Jing Zhang. 2019. “Machine Learning with\nCrowdsourcing: A Brief Summary of the Past Research and Future\nDirections.” Proceedings of the AAAI Conference on Artificial\nIntelligence 33 (01): 9837–43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\n\nShi, Hongrui, and Valentin Radu. 2022. “Data Selection for\nEfficient Model Update in Federated Learning.” In Proceedings\nof the 2nd European Workshop on Machine Learning and Systems,\n72–78.\n\n\nSuda, Naveen, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma,\nSarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016.\n“Throughput-Optimized OpenCL-Based FPGA Accelerator for\nLarge-Scale Convolutional Neural Networks.” In Proceedings of\nthe 2016 ACM/SIGDA International Symposium on Field-Programmable Gate\nArrays, 16–25.\n\n\nSze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017a.\n“Efficient Processing of Deep Neural Networks: A Tutorial and\nSurvey,” March. https://arxiv.org/abs/1703.09039.\n\n\nSze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2017b.\n“Efficient Processing of Deep Neural Networks: A Tutorial and\nSurvey.” Proceedings of the IEEE 105 (12): 2295–2329.\n\n\nTan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler,\nAndrew Howard, and Quoc V Le. 2019. “Mnasnet: Platform-Aware\nNeural Architecture Search for Mobile.” In Proceedings of the\nIEEE/CVF Conference on Computer Vision and Pattern Recognition,\n2820–28.\n\n\nTan, Mingxing, and Quoc V. Le. 2020. “EfficientNet: Rethinking\nModel Scaling for Convolutional Neural Networks.” https://arxiv.org/abs/1905.11946.\n\n\nTang, Xin, Yichun He, and Jia Liu. 2022. “Soft Bioelectronics for\nCardiac Interfaces.” Biophysics Reviews 3 (1).\n\n\nTang, Xin, Hao Shen, Siyuan Zhao, Na Li, and Jia Liu. 2023.\n“Flexible Brain–Computer Interfaces.” Nature\nElectronics 6 (2): 109–18.\n\n\nTeam, The Theano Development, Rami Al-Rfou, Guillaume Alain, Amjad\nAlmahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, et\nal. 2016. “Theano: A Python Framework for Fast Computation of\nMathematical Expressions.” https://arxiv.org/abs/1605.02688.\n\n\n“The Ultimate Guide to Deep Learning Model Quantization and\nQuantization-Aware Training.” n.d. https://deci.ai/quantization-and-quantization-aware-training/.\n\n\nTirtalistyani, Rose, Murtiningrum Murtiningrum, and Rameshwar S Kanwar.\n2022. “Indonesia Rice Irrigation System: Time for\nInnovation.” Sustainability 14 (19): 12477.\n\n\nTokui, Seiya, Kenta Oono, Shohei Hido, and Justin Clayton. 2015.\n“Chainer: A Next-Generation Open Source Framework for Deep\nLearning.” In Proceedings of Workshop on Machine Learning\nSystems (LearningSys) in the Twenty-Ninth Annual Conference on Neural\nInformation Processing Systems (NIPS), 5:1–6.\n\n\nVaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\nJones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017.\n“Attention Is All You Need.” Advances in Neural\nInformation Processing Systems 30.\n\n\n“Vector-Borne Diseases.” n.d. https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases.\n\n\nVerma, Naveen, Hongyang Jia, Hossein Valavi, Yinqi Tang, Murat Ozatay,\nLung-Yen Chen, Bonan Zhang, and Peter Deaville. 2019. “In-Memory\nComputing: Advances and Prospects.” IEEE Solid-State Circuits\nMagazine 11 (3): 43–55.\n\n\nVerma, Team Dual_Boot: Swapnil. 2022. “Elephant AI.”\nHackster.io. https://www.hackster.io/dual_boot/elephant-ai-ba71e9.\n\n\nVinuesa, Ricardo, Hossein Azizpour, Iolanda Leite, Madeline Balaam,\nVirginia Dignum, Sami Domisch, Anna Felländer, Simone Daniela Langhans,\nMax Tegmark, and Francesco Fuso Nerini. 2020. “The Role of\nArtificial Intelligence in Achieving the Sustainable Development\nGoals.” Nature Communications 11 (1): 1–10.\n\n\nVivet, Pascal, Eric Guthmuller, Yvain Thonnart, Gael Pillonnet, César\nFuguet, Ivan Miro-Panades, Guillaume Moritz, et al. 2021. “IntAct:\nA 96-Core Processor with Six Chiplets 3D-Stacked on an Active Interposer\nwith Distributed Interconnects and Integrated Power Management.”\nIEEE Journal of Solid-State Circuits 56 (1): 79–97. https://doi.org/10.1109/JSSC.2020.3036341.\n\n\nWang, Tianzhe, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang,\nYujun Lin, and Song Han. 2020. “APQ: Joint Search for Network\nArchitecture, Pruning and Quantization Policy.” In 2020\nIEEE/CVF Conference on Computer Vision and Pattern Recognition\n(CVPR), 2075–84. https://doi.org/10.1109/CVPR42600.2020.00215.\n\n\nWarden, Pete. 2018. “Speech Commands: A Dataset for\nLimited-Vocabulary Speech Recognition.” arXiv Preprint\narXiv:1804.03209.\n\n\nWarden, Pete, and Daniel Situnayake. 2019. Tinyml: Machine Learning\nwith Tensorflow Lite on Arduino and Ultra-Low-Power\nMicrocontrollers. O’Reilly Media.\n\n\nWeik, Martin H. 1955. A Survey of Domestic\nElectronic Digital Computing\nSystems. Ballistic Research Laboratories.\n\n\nWong, H-S Philip, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu,\nPang-Shiu Chen, Byoungil Lee, Frederick T Chen, and Ming-Jinn Tsai.\n2012. “Metal–Oxide RRAM.” Proceedings of the IEEE\n100 (6): 1951–70.\n\n\nWu, Bichen, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming\nWu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019.\n“Fbnet: Hardware-Aware Efficient Convnet Design via Differentiable\nNeural Architecture Search.” In Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition, 10734–42.\n\n\nWu, Carole-Jean, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha\nArdalani, Kiwan Maeng, Gloria Chang, et al. 2022. “Sustainable Ai:\nEnvironmental Implications, Challenges and Opportunities.”\nProceedings of Machine Learning and Systems 4: 795–813.\n\n\nWu, Zhang Judd, and Micikevicius Isaev. 2020. “Integer\nQuantization for Deep Learning Inference: Principles and Empirical\nEvaluation).” https://doi.org/10.48550/arXiv.2004.09602.\n\n\nXiao, Seznec Lin, Demouth Wu, and Han. 2023. “SmoothQuant:\nAccurate and Efficient Post-Training Quantization for Large Language\nModels.” https://doi.org/10.48550/arXiv.2211.10438.\n\n\nXie, Cihang, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L Yuille, and\nQuoc V Le. 2020. “Adversarial Examples Improve Image\nRecognition.” In Proceedings of the IEEE/CVF Conference on\nComputer Vision and Pattern Recognition, 819–28.\n\n\nXiong, Siyu, Guoqing Wu, Xitian Fan, Xuan Feng, Zhongcheng Huang, Wei\nCao, Xuegong Zhou, et al. 2021. “MRI-Based Brain\nTumor Segmentation Using FPGA-Accelerated Neural\nNetwork.” BMC Bioinformatics 22 (1): 421. https://doi.org/10.1186/s12859-021-04347-6.\n\n\nXiu, Liming. 2019. “Time Moore: Exploiting Moore’s Law from the\nPerspective of Time.” IEEE Solid-State Circuits Magazine\n11 (1): 39–55.\n\n\nXu, Chen, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong\nWang, and Hongbin Zha. 2018. “Alternating Multi-Bit Quantization\nfor Recurrent Neural Networks.” arXiv Preprint\narXiv:1802.00150.\n\n\nXu, Hu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes,\nVasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph\nFeichtenhofer. 2023. “Demystifying CLIP Data.” arXiv\nPreprint arXiv:2309.16671.\n\n\nXu, Zheng, Yanxiang Zhang, Galen Andrew, Christopher A Choquette-Choo,\nPeter Kairouz, H Brendan McMahan, Jesse Rosenstock, and Yuanbo Zhang.\n2023. “Federated Learning of Gboard Language Models with\nDifferential Privacy.” arXiv Preprint arXiv:2305.18465.\n\n\nYang, Lei, Zheyu Yan, Meng Li, Hyoukjun Kwon, Liangzhen Lai, Tushar\nKrishna, Vikas Chandra, Weiwen Jiang, and Yiyu Shi. 2020.\n“Co-Exploration of Neural Architectures and Heterogeneous ASIC\nAccelerator Designs Targeting Multiple Tasks.” https://arxiv.org/abs/2002.04116.\n\n\nYang, Tien-Ju, Yonghui Xiao, Giovanni Motta, Françoise Beaufays, Rajiv\nMathews, and Mingqing Chen. 2023. “Online Model Compression for\nFederated Learning with Large Models.” In ICASSP 2023-2023\nIEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), 1–5. IEEE.\n\n\nYoung, Tom, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018.\n“Recent Trends in Deep Learning Based Natural Language\nProcessing.” Ieee Computational intelligenCe Magazine 13\n(3): 55–75.\n\n\nZennaro, Marco, Brian Plancher, and V Janapa Reddi. 2022. “TinyML:\nApplied AI for Development.” In The UN 7th Multi-Stakeholder\nForum on Science, Technology and Innovation for the Sustainable\nDevelopment Goals, 2022–05.\n\n\nZhang, Chen, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason\nOptimizing Cong. 2015. “FPGA-Based Accelerator Design for Deep\nConvolutional Neural Networks Proceedings of the 2015 ACM.” In\nSIGDA International Symposium on Field-Programmable Gate\nArrays-FPGA, 15:161–70.\n\n\nZhang, Li Lyna, Yuqing Yang, Yuhang Jiang, Wenwu Zhu, and Yunxin Liu.\n2020. “Fast Hardware-Aware Neural Architecture Search.” In\nProceedings of the IEEE/CVF Conference on Computer Vision and\nPattern Recognition (CVPR) Workshops.\n\n\nZhang, Tunhou, Hsin-Pai Cheng, Zhenwen Li, Feng Yan, Chengyu Huang, Hai\nLi, and Yiran Chen. 2019. “AutoShrink: A Topology-Aware NAS for\nDiscovering Efficient Neural Architecture.” https://arxiv.org/abs/1911.09251.\n\n\nZhao, Yue, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas\nChandra. 2018. “Federated Learning with Non-Iid Data.”\narXiv Preprint arXiv:1806.00582.\n\n\nZhou, Chuteng, Fernando Garcia Redondo, Julian Büchel, Irem Boybat,\nXavier Timoneda Comas, S. R. Nandakumar, Shidhartha Das, Abu Sebastian,\nManuel Le Gallo, and Paul N. Whatmough. 2021. “AnalogNets: ML-HW\nCo-Design of Noise-Robust TinyML Models and Always-on Analog\nCompute-in-Memory Accelerator.” https://arxiv.org/abs/2111.06503.\n\n\nZhu, Hongyu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand\nJayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko.\n2018. “Benchmarking and Analyzing Deep Neural Network\nTraining.” In 2018 IEEE International Symposium on Workload\nCharacterization (IISWC), 88–100. IEEE."
+ "text": "Abadi, Martin, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya\nMironov, Kunal Talwar, and Li Zhang. 2016. “Deep Learning with\nDifferential Privacy.” In Proceedings of the 2016 ACM SIGSAC\nConference on Computer and Communications Security, 308–18.\n\n\nAbadi, Martı́n, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,\nJeffrey Dean, Matthieu Devin, et al. 2016. “{TensorFlow}: A System for {Large-Scale} Machine Learning.” In 12th\nUSENIX Symposium on Operating Systems Design and Implementation (OSDI\n16), 265–83.\n\n\nAdolf, Robert, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David\nBrooks. 2016. “Fathom: Reference Workloads for Modern Deep\nLearning Methods.” In 2016 IEEE International Symposium on\nWorkload Characterization (IISWC), 1–10. IEEE.\n\n\nAledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020.\n“Federated Learning: A Survey on Enabling Technologies, Protocols,\nand Applications.” IEEE Access 8: 140699–725. https://doi.org/10.1109/access.2020.3013541.\n\n\nAltayeb, Moez, Marco Zennaro, and Marcelo Rovai. 2022.\n“Classifying Mosquito Wingbeat Sound Using TinyML.” In\nProceedings of the 2022 ACM Conference on Information Technology for\nSocial Good, 132–37.\n\n\nAntol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv\nBatra, C Lawrence Zitnick, and Devi Parikh. 2015. “Vqa: Visual\nQuestion Answering.” In Proceedings of the IEEE International\nConference on Computer Vision, 2425–33.\n\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael\nKohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers,\nand Gregor Weber. 2020. “Common Voice: A Massively-Multilingual\nSpeech Corpus.” Proceedings of the 12th Conference on\nLanguage Resources and Evaluation, May, 4218–22.\n\n\nARM.com. n.d. “The Future Is Being Built on Arm: Market\nDiversification Continues to Drive Strong Royalty and Licensing Growth\nas Ecosystem Reaches Quarter of a Trillion Chips Milestone –\nArm®.” https://www.arm.com/company/news/2023/02/arm-announces-q3-fy22-results.\n\n\nBains, Sunny. 2020. “The Business of Building Brains.”\nNat. Electron 3 (7): 348–51.\n\n\nBamoumen, Hatim, Anas Temouden, Nabil Benamar, and Yousra Chtouki. 2022.\n“How TinyML Can Be Leveraged to Solve Environmental Problems: A\nSurvey.” In 2022 International Conference on Innovation and\nIntelligence for Informatics, Computing, and Technologies (3ICT),\n338–43. IEEE.\n\n\nBanbury, Colby R, Vijay Janapa Reddi, Max Lam, William Fu, Amin Fazel,\nJeremy Holleman, Xinyuan Huang, et al. 2020. “Benchmarking Tinyml\nSystems: Challenges and Direction.” arXiv Preprint\narXiv:2003.04821.\n\n\nBank, Dor, Noam Koenigstein, and Raja Giryes. 2023.\n“Autoencoders.” Machine Learning for Data Science\nHandbook: Data Mining and Knowledge Discovery Handbook, 353–74.\n\n\nBarroso, Luiz André, Urs Hölzle, and Parthasarathy Ranganathan. 2019.\nThe Datacenter as a Computer: Designing Warehouse-Scale\nMachines. Springer Nature.\n\n\nBender, Emily M., and Batya Friedman. 2018. “Data Statements for\nNatural Language Processing: Toward Mitigating System Bias and Enabling\nBetter Science.” Transactions of the Association for\nComputational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041.\n\n\nBenmeziane, Hadjer, Kaoutar El Maghraoui, Hamza Ouarnoughi, Smail Niar,\nMartin Wistuba, and Naigang Wang. 2021. “Hardware-Aware Neural\nArchitecture Search: Survey and Taxonomy.” In Proceedings of\nthe Thirtieth International Joint Conference on Artificial Intelligence,\nIJCAI-21, edited by Zhi-Hua Zhou, 4322–29.\nInternational Joint Conferences on Artificial Intelligence Organization.\nhttps://doi.org/10.24963/ijcai.2021/592.\n\n\nBeyer, Lucas, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and\nAäron van den Oord. 2020. “Are We Done with Imagenet?”\narXiv Preprint arXiv:2006.07159.\n\n\nBiggs, John, James Myers, Jedrzej Kufel, Emre Ozer, Simon Craske, Antony\nSou, Catherine Ramsdale, Ken Williamson, Richard Price, and Scott White.\n2021. “A Natively Flexible 32-Bit Arm Microprocessor.”\nNature 595 (7868): 532–36.\n\n\nBinkert, Nathan, Bradford Beckmann, Gabriel Black, Steven K Reinhardt,\nAli Saidi, Arkaprava Basu, Joel Hestness, et al. 2011. “The Gem5\nSimulator.” ACM SIGARCH Computer Architecture News 39\n(2): 1–7.\n\n\nBrown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,\nPrafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language\nModels Are Few-Shot Learners.” Advances in Neural Information\nProcessing Systems 33: 1877–1901.\n\n\nBurr, Geoffrey W, Matthew J Brightsky, Abu Sebastian, Huai-Yu Cheng,\nJau-Yi Wu, Sangbum Kim, Norma E Sosa, et al. 2016. “Recent\nProgress in Phase-Change Memory Technology.” IEEE Journal on\nEmerging and Selected Topics in Circuits and Systems 6 (2): 146–62.\n\n\nCai, Han, Chuang Gan, Ligeng Zhu, and Song Han. 2020. “Tinytl:\nReduce Memory, Not Parameters for Efficient on-Device Learning.”\nAdvances in Neural Information Processing Systems 33: 11285–97.\n\n\nCai, Han, Ligeng Zhu, and Song Han. 2018. “Proxylessnas: Direct\nNeural Architecture Search on Target Task and Hardware.”\narXiv Preprint arXiv:1812.00332.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009.\n“Semi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book\nReviews].” IEEE Transactions on Neural Networks 20 (3):\n542–42. https://doi.org/10.1109/tnn.2009.2015974.\n\n\nChen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,\nHaichen Shen, Meghan Cowan, et al. 2018. “{TVM}: An\nAutomated {End-to-End} Optimizing Compiler for Deep\nLearning.” In 13th USENIX Symposium on Operating Systems\nDesign and Implementation (OSDI 18), 578–94.\n\n\nChen, Tianqi, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016.\n“Training Deep Nets with Sublinear Memory Cost.” arXiv\nPreprint arXiv:1604.06174.\n\n\nChen, Zhiyong, and Shugong Xu. 2023. “Learning\nDomain-Heterogeneous Speaker Recognition Systems with Personalized\nContinual Federated Learning.” EURASIP Journal on Audio,\nSpeech, and Music Processing 2023 (1): 33.\n\n\nChen (陈新宇), Xinyu. 2022. “Inpainting Fluid\nDynamics with Tensor\nDecomposition (NumPy).”\nMedium. https://medium.com/@xinyu.chen/inpainting-fluid-dynamics-with-tensor-decomposition-numpy-d84065fead4d.\n\n\nCheng, Yu, Duo Wang, Pan Zhou, and Tao Zhang. 2017. “A Survey of\nModel Compression and Acceleration for Deep Neural Networks.”\narXiv Preprint arXiv:1710.09282.\n\n\nChi, Ping, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu,\nYu Wang, and Yuan Xie. 2016. “Prime: A Novel Processing-in-Memory\nArchitecture for Neural Network Computation in Reram-Based Main\nMemory.” ACM SIGARCH Computer Architecture News 44 (3):\n27–39.\n\n\nChollet, François. 2018. “Introduction to Keras.” March\n9th.\n\n\nChu, Grace, Okan Arikan, Gabriel Bender, Weijun Wang, Achille Brighton,\nPieter-Jan Kindermans, Hanxiao Liu, Berkin Akin, Suyog Gupta, and Andrew\nHoward. 2021. “Discovering Multi-Hardware Mobile Models via\nArchitecture Search.” In Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition, 3022–31. https://arxiv.org/abs/2008.08178.\n\n\nChua, Leon. 1971. “Memristor-the Missing Circuit Element.”\nIEEE Transactions on Circuit Theory 18 (5): 507–19.\n\n\nColeman, Cody, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter\nBailis, Alexander C Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia,\nand I Zeki Yalniz. 2022. “Similarity Search for Efficient Active\nLearning and Search of Rare Concepts.” In Proceedings of the\nAAAI Conference on Artificial Intelligence, 36:6402–10. 6.\n\n\nColeman, Cody, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang,\nLuigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia.\n2017. “Dawnbench: An End-to-End Deep Learning Benchmark and\nCompetition.” Training 100 (101): 102.\n\n\nDavid, Robert, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat\nJeffries, Jian Li, Nick Kreeger, et al. 2021. “Tensorflow Lite\nMicro: Embedded Machine Learning for Tinyml Systems.”\nProceedings of Machine Learning and Systems 3: 800–811.\n\n\nDavies, Mike, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya,\nYongqiang Cao, Sri Harsha Choday, Georgios Dimou, et al. 2018.\n“Loihi: A Neuromorphic Manycore Processor with on-Chip\nLearning.” Ieee Micro 38 (1): 82–99.\n\n\nDavies, Mike, Andreas Wild, Garrick Orchard, Yulia Sandamirskaya,\nGabriel A Fonseca Guerra, Prasad Joshi, Philipp Plank, and Sumedh R\nRisbud. 2021. “Advancing Neuromorphic Computing with Loihi: A\nSurvey of Results and Outlook.” Proceedings of the IEEE\n109 (5): 911–34.\n\n\nDean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark\nMao, Marc’aurelio Ranzato, et al. 2012. “Large Scale Distributed\nDeep Networks.” Advances in Neural Information Processing\nSystems 25.\n\n\nDeng, Jia, R. Socher, Li Fei-Fei, Wei Dong, Kai Li, and Li-Jia Li. 2009.\n“ImageNet: A Large-Scale Hierarchical Image Database.” In\n2009 IEEE Conference on Computer Vision and Pattern\nRecognition(CVPR), 00:248–55. https://doi.org/10.1109/CVPR.2009.5206848.\n\n\nDesai, Tanvi, Felix Ritchie, Richard Welpton, et al. 2016. “Five\nSafes: Designing Data Access for Research.” Economics Working\nPaper Series 1601: 28.\n\n\nDevlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.\n“Bert: Pre-Training of Deep Bidirectional Transformers for\nLanguage Understanding.” arXiv Preprint\narXiv:1810.04805.\n\n\nDhar, Sauptik, Junyao Guo, Jiayi Liu, Samarth Tripathi, Unmesh Kurup,\nand Mohak Shah. 2021. “A Survey of on-Device Machine Learning: An\nAlgorithms and Learning Theory Perspective.” ACM Transactions\non Internet of Things 2 (3): 1–49.\n\n\nDong, Xin, Barbara De Salvo, Meng Li, Chiao Liu, Zhongnan Qu, H. T.\nKung, and Ziyun Li. 2022. “SplitNets: Designing Neural\nArchitectures for Efficient Distributed Computing on Head-Mounted\nSystems.” https://arxiv.org/abs/2204.04705.\n\n\nDongarra, Jack J. 2009. “The Evolution of High Performance\nComputing on System z.” IBM Journal of Research and\nDevelopment 53: 3–4.\n\n\nDuarte, Javier, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi,\nShvetank Prakash, and Vijay Janapa Reddi. 2022. “FastML Science\nBenchmarks: Accelerating Real-Time Scientific Edge Machine\nLearning.” arXiv Preprint arXiv:2207.07958.\n\n\nDuisterhof, Bardienus P, Srivatsan Krishnan, Jonathan J Cruz, Colby R\nBanbury, William Fu, Aleksandra Faust, Guido CHE de Croon, and Vijay\nJanapa Reddi. 2019. “Learning to Seek: Autonomous Source Seeking\nwith Deep Reinforcement Learning Onboard a Nano Drone\nMicrocontroller.” arXiv Preprint arXiv:1909.11236.\n\n\nDuisterhof, Bardienus P, Shushuai Li, Javier Burgués, Vijay Janapa\nReddi, and Guido CHE de Croon. 2021. “Sniffy Bug: A Fully\nAutonomous Swarm of Gas-Seeking Nano Quadcopters in Cluttered\nEnvironments.” In 2021 IEEE/RSJ International Conference on\nIntelligent Robots and Systems (IROS), 9099–9106. IEEE.\n\n\nDwork, Cynthia, Aaron Roth, et al. 2014. “The Algorithmic\nFoundations of Differential Privacy.” Foundations and\nTrends in Theoretical Computer Science 9 (3–4):\n211–407.\n\n\nEshraghian, Jason K., Max Ward, Emre O. Neftci, Xinxin Wang, Gregor\nLenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D. Lu.\n2023. “Training Spiking Neural Networks Using Lessons from Deep\nLearning.” Proceedings of the IEEE 111 (9): 1016–54. https://doi.org/10.1109/JPROC.2023.3308088.\n\n\nEsteva, Andre, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M\nSwetter, Helen M Blau, and Sebastian Thrun. 2017.\n“Dermatologist-Level Classification of Skin Cancer with Deep\nNeural Networks.” Nature 542 (7639): 115–18.\n\n\nFahim, Farah, Benjamin Hawks, Christian Herwig, James Hirschauer, Sergo\nJindariani, Nhan Tran, Luca P. Carloni, et al. 2021. “Hls4ml: An\nOpen-Source Codesign Workflow to Empower Scientific Low-Power Machine\nLearning Devices.” https://arxiv.org/abs/2103.05579.\n\n\nFarah, Martha J. 2005. “Neuroethics: The Practical and the\nPhilosophical.” Trends in Cognitive Sciences 9 (1):\n34–40.\n\n\nFowers, Jeremy, Kalin Ovtcharov, Michael Papamichael, Todd Massengill,\nMing Liu, Daniel Lo, Shlomi Alkalay, et al. 2018. “A Configurable\nCloud-Scale DNN Processor for Real-Time AI.” In 2018 ACM/IEEE\n45th Annual International Symposium on Computer Architecture\n(ISCA), 1–14. IEEE.\n\n\nFrankle, Jonathan, and Michael Carbin. 2019. “The\nLottery Ticket Hypothesis:\nFinding Sparse, Trainable\nNeural Networks.” arXiv. https://doi.org/10.48550/arXiv.1803.03635.\n\n\nFurber, Steve. 2016. “Large-Scale Neuromorphic Computing\nSystems.” Journal of Neural Engineering 13 (5): 051001.\n\n\nGale, Trevor, Erich Elsen, and Sara Hooker. 2019. “The State of\nSparsity in Deep Neural Networks.” arXiv Preprint\narXiv:1902.09574.\n\n\nGannot, G., and M. Ligthart. 1994. “Verilog HDL Based FPGA\nDesign.” In International Verilog HDL Conference, 86–92.\nhttps://doi.org/10.1109/IVC.1994.323743.\n\n\nGates, Byron D. 2009. “Flexible Electronics.”\nScience 323 (5921): 1566–67.\n\n\nGaviria Rojas, William, Sudnya Diamos, Keertan Kini, David Kanter, Vijay\nJanapa Reddi, and Cody Coleman. 2022. “The Dollar Street Dataset:\nImages Representing the Geographic and Socioeconomic Diversity of the\nWorld.” Advances in Neural Information Processing\nSystems 35: 12979–90.\n\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman\nVaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021.\n“Datasheets for Datasets.” Communications of the\nACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\nGholami, Dong Kim, Mahoney Yao, and Keutzer. 2021. “A Survey of\nQuantization Methods for Efficient Neural Network Inference).” https://doi.org/10.48550/arXiv.2103.13630.\n\n\nGoodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David\nWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020.\n“Generative Adversarial Networks.” Communications of\nthe ACM 63 (11): 139–44.\n\n\nGoodyear, Victoria A. 2017. “Social Media, Apps and Wearable\nTechnologies: Navigating Ethical Dilemmas and Procedures.”\nQualitative Research in Sport, Exercise and Health 9 (3):\n285–302.\n\n\nGoogle. 2023. “Three Floating Point Formats.” https://storage.googleapis.com/gweb-cloudblog-publish/images/Three_floating-point_formats.max-624x261.png.\n\n\n———. n.d. “Information Quality & Content Moderation.”\nhttps://blog.google/documents/83/.\n\n\nGordon, Ariel, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang,\nand Edward Choi. 2018. “Morphnet: Fast & Simple\nResource-Constrained Structure Learning of Deep Networks.” In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 1586–95.\n\n\nGruslys, Audrunas, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex\nGraves. 2016. “Memory-Efficient Backpropagation Through\nTime.” Advances in Neural Information Processing Systems\n29.\n\n\nGu, Ivy. 2023. “Deep Learning Model\nCompression (Ii) by Ivy\nGu Medium.” https://ivygdy.medium.com/deep-learning-model-compression-ii-546352ea9453.\n\n\nGwennap, Linley. n.d. “Certus-NX\nInnovates General-Purpose\nFPGAs.”\n\n\nHaensch, Wilfried, Tayfun Gokmen, and Ruchir Puri. 2018. “The Next\nGeneration of Deep Learning Hardware: Analog Computing.”\nProceedings of the IEEE 107 (1): 108–22.\n\n\nHan, Song, Huizi Mao, and William J. Dally. 2016. “Deep\nCompression: Compressing Deep Neural Networks with Pruning, Trained\nQuantization and Huffman Coding.” https://arxiv.org/abs/1510.00149.\n\n\nHan, Mao, and Dally. 2016. “Deep Compression: Compressing Deep\nNeural Networks with Pruning, Trained Quantization and Huffman\nCoding.” https://doi.org/10.48550/arXiv.1510.00149.\n\n\nHazan, Avi, and Elishai Ezra Tsur. 2021. “Neuromorphic Analog\nImplementation of Neural Engineering Framework-Inspired Spiking Neuron\nfor High-Dimensional Representation.” Frontiers in\nNeuroscience 15: 627221.\n\n\nHe, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.\n“Deep Residual Learning for Image Recognition.” In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 770–78.\n\n\nHegde, Sumant. 2023. “An Introduction to\nSeparable Convolutions -\nAnalytics Vidhya.” https://www.analyticsvidhya.com/blog/2021/11/an-introduction-to-separable-convolutions/.\n\n\nHendrycks, Dan, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn\nSong. 2021. “Natural Adversarial Examples.” In\nProceedings of the IEEE/CVF Conference on Computer Vision and\nPattern Recognition, 15262–71.\n\n\nHennessy, John L, and David A Patterson. 2019. “A New Golden Age\nfor Computer Architecture.” Commun. ACM 62 (2): 48–60.\n\n\nHinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling\nthe Knowledge in a Neural Network.” https://arxiv.org/abs/1503.02531.\n\n\nHolland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia\nChmielinski. 2020. “The Dataset Nutrition Label.” Data\nProtection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.\n\n\nHong, Sanghyun, Nicholas Carlini, and Alexey Kurakin. 2023.\n“Publishing Efficient on-Device Models Increases Adversarial\nVulnerability.” In 2023 IEEE Conference on Secure and\nTrustworthy Machine Learning (SaTML), 271–90. IEEE.\n\n\nHoward, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun\nWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017a.\n“MobileNets: Efficient Convolutional Neural Networks for Mobile\nVision Applications.” arXiv Preprint arXiv:1704.04861.\nhttps://arxiv.org/abs/1704.04861.\n\n\n———. 2017b. “MobileNets: Efficient\nConvolutional Neural Networks for\nMobile Vision\nApplications.” arXiv. https://doi.org/10.48550/arXiv.1704.04861.\n\n\nHuang, Tsung-Ching, Kenjiro Fukuda, Chun-Ming Lo, Yung-Hui Yeh, Tsuyoshi\nSekitani, Takao Someya, and Kwang-Ting Cheng. 2010. “Pseudo-CMOS:\nA Design Style for Low-Cost and Robust Flexible Electronics.”\nIEEE Transactions on Electron Devices 58 (1): 141–50.\n\n\nIandola, Forrest N, Song Han, Matthew W Moskewicz, Khalid Ashraf,\nWilliam J Dally, and Kurt Keutzer. 2016. “SqueezeNet:\nAlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model\nSize.” arXiv Preprint arXiv:1602.07360.\n\n\nIgnatov, Andrey, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim\nHartley, and Luc Van Gool. 2018a. “AI Benchmark:\nRunning Deep Neural Networks on Android Smartphones.”\n\n\n———. 2018b. “Ai Benchmark: Running Deep Neural Networks on Android\nSmartphones.” In Proceedings of the European Conference on\nComputer Vision (ECCV) Workshops, 0–0.\n\n\nImani, Mohsen, Abbas Rahimi, and Tajana S Rosing. 2016. “Resistive\nConfigurable Associative Memory for Approximate Computing.” In\n2016 Design, Automation & Test in Europe Conference &\nExhibition (DATE), 1327–32. IEEE.\n\n\nIntelLabs. 2023. “Knowledge Distillation -\nNeural Network Distiller.”\nhttps://intellabs.github.io/distiller/knowledge_distillation.html.\n\n\nISSCC. 2014. “Computing’s Energy Problem (and What We Can Do about\nIt).” https://ieeexplore.ieee.org/document/6757323.\n\n\nJacob, Benoit, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang,\nAndrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018.\n“Quantization and Training of Neural Networks for Efficient\nInteger-Arithmetic-Only Inference.” In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition,\n2704–13.\n\n\nJia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan\nLong, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014.\n“Caffe: Convolutional Architecture for Fast Feature\nEmbedding.” In Proceedings of the 22nd ACM International\nConference on Multimedia, 675–78.\n\n\nJia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza.\n2018. “Dissecting the NVIDIA Volta\nGPU Architecture via\nMicrobenchmarking.” arXiv. http://arxiv.org/abs/1804.06826.\n\n\nJia, Zhenge, Dawei Li, Xiaowei Xu, Na Li, Feng Hong, Lichuan Ping, and\nYiyu Shi. 2023. “Life-Threatening Ventricular Arrhythmia Detection\nChallenge in Implantable Cardioverter–Defibrillators.” Nature\nMachine Intelligence 5 (5): 554–55.\n\n\nJia, Zhihao, Matei Zaharia, and Alex Aiken. 2019. “Beyond Data and\nModel Parallelism for Deep Neural Networks.” Proceedings of\nMachine Learning and Systems 1: 1–13.\n\n\nJiang, Weiwen, Xinyi Zhang, Edwin H. -M. Sha, Lei Yang, Qingfeng Zhuge,\nYiyu Shi, and Jingtong Hu. 2019. “Accuracy Vs. Efficiency:\nAchieving Both Through FPGA-Implementation Aware Neural Architecture\nSearch.” https://arxiv.org/abs/1901.11211.\n\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur\nSridhar, Karl Rosaen, and Ram Vasudevan. 2017. “Driving in the\nMatrix: Can Virtual Worlds Replace Human-Generated Annotations for Real\nWorld Tasks?” 2017 IEEE International Conference on Robotics\nand Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\n\nJouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017a. “In-Datacenter\nPerformance Analysis of a Tensor Processing Unit.” In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1–12. ISCA ’17. New York, NY, USA: Association for\nComputing Machinery. https://doi.org/10.1145/3079856.3080246.\n\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017b. “In-Datacenter\nPerformance Analysis of a Tensor Processing Unit.” In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1–12.\n\n\nJouppi, Norm, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng\nNai, Nishant Patil, et al. 2023. “TPU V4: An Optically\nReconfigurable Supercomputer for Machine Learning with Hardware Support\nfor Embeddings.” In Proceedings of the 50th Annual\nInternational Symposium on Computer Architecture. ISCA ’23. New\nYork, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3579371.3589350.\n\n\nKairouz, Peter, Sewoong Oh, and Pramod Viswanath. 2015. “Secure\nMulti-Party Differential Privacy.” Advances in Neural\nInformation Processing Systems 28.\n\n\nKarargyris, Alexandros, Renato Umeton, Micah J Sheller, Alejandro\nAristizabal, Johnu George, Anna Wuest, Sarthak Pati, et al. 2023.\n“Federated Benchmarking of Medical Artificial Intelligence with\nMedPerf.” Nature Machine Intelligence 5 (7): 799–810.\n\n\nKiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger,\nZhengxuan Wu, Bertie Vidgen, et al. 2021. “Dynabench: Rethinking\nBenchmarking in NLP.” arXiv Preprint arXiv:2104.14337.\n\n\nKoh, Pang Wei, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin\nZhang, Akshay Balsubramani, Weihua Hu, et al. 2021. “Wilds: A\nBenchmark of in-the-Wild Distribution Shifts.” In\nInternational Conference on Machine Learning, 5637–64. PMLR.\n\n\nKrishna, Adithya, Srikanth Rohit Nudurupati, Chandana D G, Pritesh\nDwivedi, André van Schaik, Mahesh Mehendale, and Chetan Singh Thakur.\n2023. “RAMAN: A Re-Configurable and Sparse tinyML Accelerator for\nInference on Edge.” https://arxiv.org/abs/2306.06493.\n\n\nKrishnamoorthi. 2018. “Quantizing Deep Convolutional Networks for\nEfficient Inference: A Whitepaper.” arXiv. https://doi.org/10.48550/arXiv.1806.08342.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022.\n“Self-Supervised Learning in Medicine and Healthcare.”\nNature Biomedical Engineering 6 (12): 1346–52. https://doi.org/10.1038/s41551-022-00914-1.\n\n\nKrishnan, Srivatsan, Amir Yazdanbakhsh, Shvetank Prakash, Jason Jabbour,\nIkechukwu Uchendu, Susobhan Ghosh, Behzad Boroujerdian, et al. 2023.\n“ArchGym: An Open-Source Gymnasium for Machine Learning Assisted\nArchitecture Design.” In Proceedings of the 50th Annual\nInternational Symposium on Computer Architecture, 1–16.\n\n\nKrizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.\n“Imagenet Classification with Deep Convolutional Neural\nNetworks.” Advances in Neural Information Processing\nSystems 25.\n\n\nKung, H. T., Bradley McDanel, and Sai Qian Zhang. 2018. “Packing\nSparse Convolutional Neural Networks for Efficient Systolic Array\nImplementations: Column Combining Under Joint Optimization.” https://arxiv.org/abs/1811.04770.\n\n\nKung, Hsiang Tsung, and Charles E Leiserson. 1979. “Systolic\nArrays (for VLSI).” In Sparse Matrix Proceedings 1978,\n1:256–82. Society for industrial; applied mathematics Philadelphia, PA,\nUSA.\n\n\nKuzmin, Andrey, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters,\nand Tijmen Blankevoort. 2022. “FP8 Quantization: The Power of the\nExponent.” https://arxiv.org/abs/2208.09225.\n\n\nKwon, Jisu, and Daejin Park. 2021. “Hardware/Software Co-Design\nfor TinyML Voice-Recognition Application on Resource Frugal Edge\nDevices.” Applied Sciences 11 (22). https://doi.org/10.3390/app112211073.\n\n\nKwon, Sun Hwa, and Lin Dong. 2022. “Flexible Sensors and Machine\nLearning for Heart Monitoring.” Nano Energy, 107632.\n\n\nKwon, Young D, Rui Li, Stylianos I Venieris, Jagmohan Chauhan, Nicholas\nD Lane, and Cecilia Mascolo. 2023. “TinyTrain: Deep Neural Network\nTraining at the Extreme Edge.” arXiv Preprint\narXiv:2307.09988.\n\n\nLai, Liangzhen, Naveen Suda, and Vikas Chandra. 2018a. “Cmsis-Nn:\nEfficient Neural Network Kernels for Arm Cortex-m Cpus.”\narXiv Preprint arXiv:1801.06601.\n\n\n———. 2018b. “CMSIS-NN: Efficient Neural Network Kernels for Arm\nCortex-m CPUs.” https://arxiv.org/abs/1801.06601.\n\n\nLeCun, Yann, John Denker, and Sara Solla. 1989. “Optimal Brain\nDamage.” Advances in Neural Information Processing\nSystems 2.\n\n\nLi, En, Liekang Zeng, Zhi Zhou, and Xu Chen. 2019. “Edge AI:\nOn-Demand Accelerating Deep Neural Network Inference via Edge\nComputing.” IEEE Transactions on Wireless Communications\n19 (1): 447–57.\n\n\nLi, Mu, David G Andersen, Alexander J Smola, and Kai Yu. 2014.\n“Communication Efficient Distributed Machine Learning with the\nParameter Server.” Advances in Neural Information Processing\nSystems 27.\n\n\nLi, Xiang, Tao Qin, Jian Yang, and Tie-Yan Liu. 2016. “LightRNN:\nMemory and Computation-Efficient Recurrent Neural Networks.”\nAdvances in Neural Information Processing Systems 29.\n\n\nLi, Yuhang, Xin Dong, and Wei Wang. 2020. “Additive Powers-of-Two\nQuantization: An Efficient Non-Uniform Discretization for Neural\nNetworks.” In International Conference on Learning\nRepresentations. https://openreview.net/forum?id=BkgXT24tDS.\n\n\nLi, Zhizhong, and Derek Hoiem. 2017. “Learning Without\nForgetting.” IEEE Transactions on Pattern Analysis and\nMachine Intelligence 40 (12): 2935–47.\n\n\nLin, Ji, Wei-Ming Chen, Yujun Lin, Chuang Gan, Song Han, et al. 2020.\n“Mcunet: Tiny Deep Learning on Iot Devices.” Advances\nin Neural Information Processing Systems 33: 11711–22. https://arxiv.org/abs/2007.10319.\n\n\nLin, Ji, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song\nHan. 2023. “AWQ: Activation-Aware Weight Quantization for LLM\nCompression and Acceleration.” arXiv.\n\n\nLin, Ji, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song\nHan. 2022a. “On-Device Training Under 256kb Memory.”\nAdvances in Neural Information Processing Systems 35: 22941–54.\n\n\n———. 2022b. “On-Device Training Under 256KB Memory.” In\nArXiv.\n\n\nLin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona,\nDeva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014.\n“Microsoft Coco: Common Objects in Context.” In\nComputer Vision–ECCV 2014: 13th European Conference, Zurich,\nSwitzerland, September 6-12, 2014, Proceedings, Part v 13, 740–55.\nSpringer.\n\n\nLindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008.\n“NVIDIA Tesla: A\nUnified Graphics and Computing\nArchitecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/MM.2008.31.\n\n\nLin, Tang Tang, Dang Yang, and Han Gan. 2023. “AWQ:\nActivation-Aware Weight Quantization for LLM Compression and\nAcceleration.” https://doi.org/10.48550/arXiv.2306.00978.\n\n\nLoh, Gabriel H. 2008. “3D-Stacked Memory Architectures for\nMulti-Core Processors.” ACM SIGARCH Computer Architecture\nNews 36 (3): 453–64.\n\n\nLuebke, David. 2008. “CUDA: Scalable Parallel Programming for\nHigh-Performance Scientific Computing.” In 2008 5th IEEE\nInternational Symposium on Biomedical Imaging: From Nano to Macro,\n836–38. https://doi.org/10.1109/ISBI.2008.4541126.\n\n\nLundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to\nInterpreting Model Predictions.” Advances in Neural\nInformation Processing Systems 30.\n\n\nMaass, Wolfgang. 1997. “Networks of Spiking Neurons: The Third\nGeneration of Neural Network Models.” Neural Networks 10\n(9): 1659–71.\n\n\nMarković, Danijela, Alice Mizrahi, Damien Querlioz, and Julie Grollier.\n2020. “Physics for Neuromorphic Computing.” Nature\nReviews Physics 2 (9): 499–510.\n\n\nMattson, Peter, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius\nMicikevicius, David Patterson, Hanlin Tang, et al. 2020. “Mlperf\nTraining Benchmark.” Proceedings of Machine Learning and\nSystems 2: 336–49.\n\n\nMcMahan, Brendan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise\nAguera y Arcas. 2017. “Communication-Efficient Learning of Deep\nNetworks from Decentralized Data.” In Artificial Intelligence\nand Statistics, 1273–82. PMLR.\n\n\nMiller, David AB. 2000. “Optical Interconnects to Silicon.”\nIEEE Journal of Selected Topics in Quantum Electronics 6 (6):\n1312–17.\n\n\nMittal, Sparsh, Gaurav Verma, Brajesh Kaushik, and Farooq A Khanday.\n2021. “A Survey of SRAM-Based in-Memory Computing Techniques and\nApplications.” Journal of Systems Architecture 119:\n102276.\n\n\nModha, Dharmendra S, Filipp Akopyan, Alexander Andreopoulos,\nRathinakumar Appuswamy, John V Arthur, Andrew S Cassidy, Pallab Datta,\net al. 2023. “Neural Inference at the Frontier of Energy, Space,\nand Time.” Science 382 (6668): 329–35.\n\n\nMoshawrab, Mohammad, Mehdi Adda, Abdenour Bouzouane, Hussein Ibrahim,\nand Ali Raad. 2023. “Reviewing Federated Learning Aggregation\nAlgorithms; Strategies, Contributions, Limitations and Future\nPerspectives.” Electronics 12 (10): 2287.\n\n\nMunshi, Aaftab. 2009. “The OpenCL Specification.” In\n2009 IEEE Hot Chips 21 Symposium (HCS), 1–314. https://doi.org/10.1109/HOTCHIPS.2009.7478342.\n\n\nMusk, Elon et al. 2019. “An Integrated Brain-Machine Interface\nPlatform with Thousands of Channels.” Journal of Medical\nInternet Research 21 (10): e16194.\n\n\nNguyen, Ngoc-Bao, Keshigeyan Chandrasegaran, Milad Abdollahzadeh, and\nNgai-Man Cheung. 2023. “Re-Thinking Model Inversion Attacks\nAgainst Deep Neural Networks.” In Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition, 16384–93.\n\n\nNorrie, Thomas, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li,\nJames Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021.\n“The Design Process for Google’s Training Chips: TPUv2 and\nTPUv3.” IEEE Micro 41 (2): 56–63. https://doi.org/10.1109/MM.2021.3058217.\n\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021.\n“Pervasive Label Errors in Test Sets Destabilize Machine Learning\nBenchmarks.” arXiv, March. https://doi.org/ \nhttps://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via\nDataCite.\n\n\nOoko, Samson Otieno, Marvin Muyonga Ogore, Jimmy Nsenga, and Marco\nZennaro. 2021. “TinyML in Africa: Opportunities and\nChallenges.” In 2021 IEEE Globecom Workshops (GC\nWkshps), 1–6. IEEE.\n\n\nPan, Sinno Jialin, and Qiang Yang. 2009. “A Survey on Transfer\nLearning.” IEEE Transactions on Knowledge and Data\nEngineering 22 (10): 1345–59.\n\n\nPaszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,\nGregory Chanan, Trevor Killeen, et al. 2019. “Pytorch: An\nImperative Style, High-Performance Deep Learning Library.”\nAdvances in Neural Information Processing Systems 32.\n\n\nPatterson, David A, and John L Hennessy. 2016. Computer Organization\nand Design ARM Edition: The Hardware Software Interface. Morgan\nkaufmann.\n\n\nPrakash, Shvetank, Tim Callahan, Joseph Bushagour, Colby Banbury, Alan\nV. Green, Pete Warden, Tim Ansell, and Vijay Janapa Reddi. 2023.\n“CFU Playground: Full-Stack Open-Source Framework for\nTiny Machine Learning (TinyML) Acceleration on\nFPGAs.” In 2023 IEEE International\nSymposium on Performance Analysis of Systems and Software\n(ISPASS). IEEE. https://doi.org/10.1109/ispass57527.2023.00024.\n\n\nPushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022.\n“Data Cards: Purposeful and Transparent Dataset Documentation for\nResponsible Ai.” 2022 ACM Conference on Fairness,\nAccountability, and Transparency. https://doi.org/10.1145/3531146.3533231.\n\n\nPutnam, Andrew, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros\nConstantinides, John Demme, Hadi Esmaeilzadeh, et al. 2014. “A\nReconfigurable Fabric for Accelerating Large-Scale Datacenter\nServices.” ACM SIGARCH Computer Architecture News 42\n(3): 13–24. https://doi.org/10.1145/2678373.2665678.\n\n\nQi, Chen, Shibo Shen, Rongpeng Li, Zhao Zhifeng, Qing Liu, Jing Liang,\nand Honggang Zhang. 2021. “An Efficient Pruning Scheme of Deep\nNeural Networks for Internet of Things\nApplications.” EURASIP Journal on Advances in Signal\nProcessing 2021 (June). https://doi.org/10.1186/s13634-021-00744-4.\n\n\nRaina, Rajat, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-Scale\nDeep Unsupervised Learning Using Graphics Processors.” In\nProceedings of the 26th Annual\nInternational Conference on\nMachine Learning, 873–80. Montreal Quebec\nCanada: ACM. https://doi.org/10.1145/1553374.1553486.\n\n\nRamcharan, Amanda, Kelsee Baranowski, Peter McCloskey, Babuali Ahmed,\nJames Legg, and David P Hughes. 2017. “Deep Learning for\nImage-Based Cassava Disease Detection.” Frontiers in Plant\nScience 8: 1852.\n\n\nRanganathan, Parthasarathy. 2011. “From Microprocessors to\nNanostores: Rethinking Data-Centric Systems.” Computer (Long\nBeach Calif.) 44 (1): 39–48.\n\n\nRao, Ravi. 2021. Www.wevolver.com. https://www.wevolver.com/article/tinyml-unlocks-new-possibilities-for-sustainable-development-technologies.\n\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and\nChristopher Ré. 2018. “Snorkel Metal: Weak Supervision for\nMulti-Task Learning.” Proceedings of the Second Workshop on\nData Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\n\n\nReddi, Vijay Janapa, Christine Cheng, David Kanter, Peter Mattson,\nGuenther Schmuelling, Carole-Jean Wu, Brian Anderson, et al. 2020.\n“Mlperf Inference Benchmark.” In 2020 ACM/IEEE 47th\nAnnual International Symposium on Computer Architecture (ISCA),\n446–59. IEEE.\n\n\nRibeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “\"\nWhy Should i Trust You?\" Explaining the Predictions of Any\nClassifier.” In Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining,\n1135–44.\n\n\nRosenblatt, Frank. 1957. The Perceptron, a Perceiving and\nRecognizing Automaton Project Para. Cornell Aeronautical\nLaboratory.\n\n\nRoskies, Adina. 2002. “Neuroethics for the New Millenium.”\nNeuron 35 (1): 21–23.\n\n\nRouhani, Bita, Azalia Mirhoseini, and Farinaz Koushanfar. 2017.\n“TinyDL: Just-in-Time Deep Learning Solution for Constrained\nEmbedded Systems.” In, 1–4. https://doi.org/10.1109/ISCAS.2017.8050343.\n\n\nRumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986.\n“Learning Representations by Back-Propagating Errors.”\nNature 323 (6088): 533–36.\n\n\nSamajdar, Ananda, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar\nKrishna. 2018. “Scale-Sim: Systolic Cnn Accelerator\nSimulator.” arXiv Preprint arXiv:1811.02883.\n\n\nSchuman, Catherine D, Shruti R Kulkarni, Maryam Parsa, J Parker\nMitchell, Prasanna Date, and Bill Kay. 2022. “Opportunities for\nNeuromorphic Computing Algorithms and Applications.” Nature\nComputational Science 2 (1): 10–19.\n\n\nSegal, Mark, and Kurt Akeley. 1999. “The OpenGL Graphics System: A\nSpecification (Version 1.1).”\n\n\nSegura Anaya, LH, Abeer Alsadoon, Nectar Costadopoulos, and PWC Prasad.\n2018. “Ethical Implications of User Perceptions of Wearable\nDevices.” Science and Engineering Ethics 24: 1–28.\n\n\nSeide, Frank, and Amit Agarwal. 2016. “CNTK: Microsoft’s\nOpen-Source Deep-Learning Toolkit.” In Proceedings of the\n22nd ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, 2135–35.\n\n\nSeyedzadeh, Saleh, Farzad Pour Rahimian, Ivan Glesk, and Marc Roper.\n2018. “Machine Learning for Estimation of Building Energy\nConsumption and Performance: A Review.” Visualization in\nEngineering 6: 1–20.\n\n\nShastri, Bhavin J, Alexander N Tait, Thomas Ferreira de Lima, Wolfram HP\nPernice, Harish Bhaskaran, C David Wright, and Paul R Prucnal. 2021.\n“Photonics for Artificial Intelligence and Neuromorphic\nComputing.” Nature Photonics 15 (2): 102–14.\n\n\nSheng, Victor S., and Jing Zhang. 2019. “Machine Learning with\nCrowdsourcing: A Brief Summary of the Past Research and Future\nDirections.” Proceedings of the AAAI Conference on Artificial\nIntelligence 33 (01): 9837–43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\n\nShi, Hongrui, and Valentin Radu. 2022. “Data Selection for\nEfficient Model Update in Federated Learning.” In Proceedings\nof the 2nd European Workshop on Machine Learning and Systems,\n72–78.\n\n\nSuda, Naveen, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma,\nSarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016.\n“Throughput-Optimized OpenCL-Based FPGA Accelerator for\nLarge-Scale Convolutional Neural Networks.” In Proceedings of\nthe 2016 ACM/SIGDA International Symposium on Field-Programmable Gate\nArrays, 16–25.\n\n\nSze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017a.\n“Efficient Processing of Deep Neural Networks: A Tutorial and\nSurvey,” March. https://arxiv.org/abs/1703.09039.\n\n\nSze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2017b.\n“Efficient Processing of Deep Neural Networks: A Tutorial and\nSurvey.” Proceedings of the IEEE 105 (12): 2295–2329.\n\n\nTan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler,\nAndrew Howard, and Quoc V Le. 2019. “Mnasnet: Platform-Aware\nNeural Architecture Search for Mobile.” In Proceedings of the\nIEEE/CVF Conference on Computer Vision and Pattern Recognition,\n2820–28.\n\n\nTan, Mingxing, and Quoc V. Le. 2020. “EfficientNet: Rethinking\nModel Scaling for Convolutional Neural Networks.” https://arxiv.org/abs/1905.11946.\n\n\nTang, Xin, Yichun He, and Jia Liu. 2022. “Soft Bioelectronics for\nCardiac Interfaces.” Biophysics Reviews 3 (1).\n\n\nTang, Xin, Hao Shen, Siyuan Zhao, Na Li, and Jia Liu. 2023.\n“Flexible Brain–Computer Interfaces.” Nature\nElectronics 6 (2): 109–18.\n\n\nTeam, The Theano Development, Rami Al-Rfou, Guillaume Alain, Amjad\nAlmahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, et\nal. 2016. “Theano: A Python Framework for Fast Computation of\nMathematical Expressions.” https://arxiv.org/abs/1605.02688.\n\n\n“The Ultimate Guide to Deep Learning Model Quantization and\nQuantization-Aware Training.” n.d. https://deci.ai/quantization-and-quantization-aware-training/.\n\n\nTirtalistyani, Rose, Murtiningrum Murtiningrum, and Rameshwar S Kanwar.\n2022. “Indonesia Rice Irrigation System: Time for\nInnovation.” Sustainability 14 (19): 12477.\n\n\nTokui, Seiya, Kenta Oono, Shohei Hido, and Justin Clayton. 2015.\n“Chainer: A Next-Generation Open Source Framework for Deep\nLearning.” In Proceedings of Workshop on Machine Learning\nSystems (LearningSys) in the Twenty-Ninth Annual Conference on Neural\nInformation Processing Systems (NIPS), 5:1–6.\n\n\nVaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\nJones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017.\n“Attention Is All You Need.” Advances in Neural\nInformation Processing Systems 30.\n\n\n“Vector-Borne Diseases.” n.d. https://www.who.int/news-room/fact-sheets/detail/vector-borne-diseases.\n\n\nVerma, Naveen, Hongyang Jia, Hossein Valavi, Yinqi Tang, Murat Ozatay,\nLung-Yen Chen, Bonan Zhang, and Peter Deaville. 2019. “In-Memory\nComputing: Advances and Prospects.” IEEE Solid-State Circuits\nMagazine 11 (3): 43–55.\n\n\nVerma, Team Dual_Boot: Swapnil. 2022. “Elephant AI.”\nHackster.io. https://www.hackster.io/dual_boot/elephant-ai-ba71e9.\n\n\nVinuesa, Ricardo, Hossein Azizpour, Iolanda Leite, Madeline Balaam,\nVirginia Dignum, Sami Domisch, Anna Felländer, Simone Daniela Langhans,\nMax Tegmark, and Francesco Fuso Nerini. 2020. “The Role of\nArtificial Intelligence in Achieving the Sustainable Development\nGoals.” Nature Communications 11 (1): 1–10.\n\n\nVivet, Pascal, Eric Guthmuller, Yvain Thonnart, Gael Pillonnet, César\nFuguet, Ivan Miro-Panades, Guillaume Moritz, et al. 2021. “IntAct:\nA 96-Core Processor with Six Chiplets 3D-Stacked on an Active Interposer\nwith Distributed Interconnects and Integrated Power Management.”\nIEEE Journal of Solid-State Circuits 56 (1): 79–97. https://doi.org/10.1109/JSSC.2020.3036341.\n\n\nWang, Tianzhe, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang,\nYujun Lin, and Song Han. 2020. “APQ: Joint Search for Network\nArchitecture, Pruning and Quantization Policy.” In 2020\nIEEE/CVF Conference on Computer Vision and Pattern Recognition\n(CVPR), 2075–84. https://doi.org/10.1109/CVPR42600.2020.00215.\n\n\nWarden, Pete. 2018. “Speech Commands: A Dataset for\nLimited-Vocabulary Speech Recognition.” arXiv Preprint\narXiv:1804.03209.\n\n\nWarden, Pete, and Daniel Situnayake. 2019. Tinyml: Machine Learning\nwith Tensorflow Lite on Arduino and Ultra-Low-Power\nMicrocontrollers. O’Reilly Media.\n\n\nWeik, Martin H. 1955. A Survey of Domestic\nElectronic Digital Computing\nSystems. Ballistic Research Laboratories.\n\n\nWong, H-S Philip, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu,\nPang-Shiu Chen, Byoungil Lee, Frederick T Chen, and Ming-Jinn Tsai.\n2012. “Metal–Oxide RRAM.” Proceedings of the IEEE\n100 (6): 1951–70.\n\n\nWu, Bichen, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming\nWu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019.\n“Fbnet: Hardware-Aware Efficient Convnet Design via Differentiable\nNeural Architecture Search.” In Proceedings of the IEEE/CVF\nConference on Computer Vision and Pattern Recognition, 10734–42.\n\n\nWu, Carole-Jean, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha\nArdalani, Kiwan Maeng, Gloria Chang, et al. 2022. “Sustainable Ai:\nEnvironmental Implications, Challenges and Opportunities.”\nProceedings of Machine Learning and Systems 4: 795–813.\n\n\nWu, Zhang Judd, and Micikevicius Isaev. 2020. “Integer\nQuantization for Deep Learning Inference: Principles and Empirical\nEvaluation).” https://doi.org/10.48550/arXiv.2004.09602.\n\n\nXiao, Seznec Lin, Demouth Wu, and Han. 2023. “SmoothQuant:\nAccurate and Efficient Post-Training Quantization for Large Language\nModels.” https://doi.org/10.48550/arXiv.2211.10438.\n\n\nXie, Cihang, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L Yuille, and\nQuoc V Le. 2020. “Adversarial Examples Improve Image\nRecognition.” In Proceedings of the IEEE/CVF Conference on\nComputer Vision and Pattern Recognition, 819–28.\n\n\nXiong, Siyu, Guoqing Wu, Xitian Fan, Xuan Feng, Zhongcheng Huang, Wei\nCao, Xuegong Zhou, et al. 2021. “MRI-Based Brain\nTumor Segmentation Using FPGA-Accelerated Neural\nNetwork.” BMC Bioinformatics 22 (1): 421. https://doi.org/10.1186/s12859-021-04347-6.\n\n\nXiu, Liming. 2019. “Time Moore: Exploiting Moore’s Law from the\nPerspective of Time.” IEEE Solid-State Circuits Magazine\n11 (1): 39–55.\n\n\nXu, Chen, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong\nWang, and Hongbin Zha. 2018. “Alternating Multi-Bit Quantization\nfor Recurrent Neural Networks.” arXiv Preprint\narXiv:1802.00150.\n\n\nXu, Hu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes,\nVasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph\nFeichtenhofer. 2023. “Demystifying CLIP Data.” arXiv\nPreprint arXiv:2309.16671.\n\n\nXu, Zheng, Yanxiang Zhang, Galen Andrew, Christopher A Choquette-Choo,\nPeter Kairouz, H Brendan McMahan, Jesse Rosenstock, and Yuanbo Zhang.\n2023. “Federated Learning of Gboard Language Models with\nDifferential Privacy.” arXiv Preprint arXiv:2305.18465.\n\n\nYang, Lei, Zheyu Yan, Meng Li, Hyoukjun Kwon, Liangzhen Lai, Tushar\nKrishna, Vikas Chandra, Weiwen Jiang, and Yiyu Shi. 2020.\n“Co-Exploration of Neural Architectures and Heterogeneous ASIC\nAccelerator Designs Targeting Multiple Tasks.” https://arxiv.org/abs/2002.04116.\n\n\nYang, Tien-Ju, Yonghui Xiao, Giovanni Motta, Françoise Beaufays, Rajiv\nMathews, and Mingqing Chen. 2023. “Online Model Compression for\nFederated Learning with Large Models.” In ICASSP 2023-2023\nIEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), 1–5. IEEE.\n\n\nYoung, Tom, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018.\n“Recent Trends in Deep Learning Based Natural Language\nProcessing.” Ieee Computational intelligenCe Magazine 13\n(3): 55–75.\n\n\nZennaro, Marco, Brian Plancher, and V Janapa Reddi. 2022. “TinyML:\nApplied AI for Development.” In The UN 7th Multi-Stakeholder\nForum on Science, Technology and Innovation for the Sustainable\nDevelopment Goals, 2022–05.\n\n\nZhang, Chen, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason\nOptimizing Cong. 2015. “FPGA-Based Accelerator Design for Deep\nConvolutional Neural Networks Proceedings of the 2015 ACM.” In\nSIGDA International Symposium on Field-Programmable Gate\nArrays-FPGA, 15:161–70.\n\n\nZhang, Li Lyna, Yuqing Yang, Yuhang Jiang, Wenwu Zhu, and Yunxin Liu.\n2020. “Fast Hardware-Aware Neural Architecture Search.” In\nProceedings of the IEEE/CVF Conference on Computer Vision and\nPattern Recognition (CVPR) Workshops.\n\n\nZhang, Tunhou, Hsin-Pai Cheng, Zhenwen Li, Feng Yan, Chengyu Huang, Hai\nLi, and Yiran Chen. 2019. “AutoShrink: A Topology-Aware NAS for\nDiscovering Efficient Neural Architecture.” https://arxiv.org/abs/1911.09251.\n\n\nZhao, Yue, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas\nChandra. 2018. “Federated Learning with Non-Iid Data.”\narXiv Preprint arXiv:1806.00582.\n\n\nZhou, Chuteng, Fernando Garcia Redondo, Julian Büchel, Irem Boybat,\nXavier Timoneda Comas, S. R. Nandakumar, Shidhartha Das, Abu Sebastian,\nManuel Le Gallo, and Paul N. Whatmough. 2021. “AnalogNets: ML-HW\nCo-Design of Noise-Robust TinyML Models and Always-on Analog\nCompute-in-Memory Accelerator.” https://arxiv.org/abs/2111.06503.\n\n\nZhu, Hongyu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand\nJayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko.\n2018. “Benchmarking and Analyzing Deep Neural Network\nTraining.” In 2018 IEEE International Symposium on Workload\nCharacterization (IISWC), 88–100. IEEE."
},
{
"objectID": "tools.html#hardware-kits",