diff --git a/contents/core/benchmarking/benchmarking.bib b/contents/core/benchmarking/benchmarking.bib index a804ddd8..863e8d38 100644 --- a/contents/core/benchmarking/benchmarking.bib +++ b/contents/core/benchmarking/benchmarking.bib @@ -86,7 +86,7 @@ @article{10.1145/3467017 abstract = {After decades of incentivizing the isolation of hardware, software, and algorithm development, the catalysts for closer collaboration are changing the paradigm.}, journal = {Commun. ACM}, month = nov, -pages = {58–65}, +pages = {58-65}, numpages = {8} } diff --git a/contents/core/benchmarking/benchmarking.qmd b/contents/core/benchmarking/benchmarking.qmd index d67f3399..6c9dc55d 100644 --- a/contents/core/benchmarking/benchmarking.qmd +++ b/contents/core/benchmarking/benchmarking.qmd @@ -36,7 +36,7 @@ This chapter will provide an overview of popular ML benchmarks, best practices f ::: -## Introduction {#sec-benchmarking-ai} +## Introduction Benchmarking provides the essential measurements needed to drive machine learning progress and truly understand system performance. As the physicist Lord Kelvin famously said, "To measure is to know." Benchmarks allow us to quantitatively know the capabilities of different models, software, and hardware. They allow ML developers to measure the inference time, memory usage, power consumption, and other metrics that characterize a system. Moreover, benchmarks create standardized processes for measurement, enabling fair comparisons across different solutions. @@ -62,7 +62,7 @@ This chapter will cover the 3 types of AI benchmarks, the standard metrics, tool ## Historical Context -### Standard Benchmarks +### Performance Benchmarks The evolution of benchmarks in computing vividly illustrates the industry's relentless pursuit of excellence and innovation. In the early days of computing during the 1960s and 1970s, benchmarks were rudimentary and designed for mainframe computers. For example, the [Whetstone benchmark](https://en.wikipedia.org/wiki/Whetstone_(benchmark)), named after the Whetstone ALGOL compiler, was one of the first standardized tests to measure the floating-point arithmetic performance of a CPU. These pioneering benchmarks prompted manufacturers to refine their architectures and algorithms to achieve better benchmark scores. @@ -72,7 +72,25 @@ The 1990s brought the era of graphics-intensive applications and video games. Th The 2000s saw a surge in mobile phones and portable devices like tablets. With portability came the challenge of balancing performance and power consumption. Benchmarks like [MobileMark](https://bapco.com/products/mobilemark-2014/) by BAPCo evaluated speed and battery life. This drove companies to develop more energy-efficient System-on-Chips (SOCs), leading to the emergence of architectures like ARM that prioritized power efficiency. -The focus of the recent decade has shifted towards cloud computing, big data, and artificial intelligence. Cloud service providers like Amazon Web Services and Google Cloud compete on performance, scalability, and cost-effectiveness. Tailored cloud benchmarks like [CloudSuite](http://cloudsuite.ch/) have become essential, driving providers to optimize their infrastructure for better services. Furthermore, benchmarks like [SPEC Power](https://www.spec.org/power/) and [Green500](https://top500.org/lists/green500/) that evaluate performance and power efficiency have grown in popularity to combat the rising carbon foorprint of datacenter computing. +The focus of the recent decade has shifted towards cloud computing, big data, and artificial intelligence. Cloud service providers like Amazon Web Services and Google Cloud compete on performance, scalability, and cost-effectiveness. Tailored cloud benchmarks like [CloudSuite](http://cloudsuite.ch/) have become essential, driving providers to optimize their infrastructure for better services. + +### Energy Benchmarks + +Energy consumption and environmental concerns have gained prominence in recent years, making power (more precisely, energy) benchmarking increasingly important in the industry. This shift began in the mid-2000s when processors and systems started hitting cooling limits, and scaling became a crucial aspect of building large-scale systems due to internet advancements. Since then, energy considerations have expanded to encompass all areas of computing, from personal devices to large-scale data centers. + +Power benchmarking aims to measure the energy efficiency of computing systems, evaluating performance in relation to power consumption. This is crucial for several reasons: + +* **Environmental impact:** With the growing carbon footprint of the tech industry, there's a pressing need to reduce energy consumption. +* **Operational costs:** Energy expenses constitute a significant portion of data center operating costs. +* **Device longevity:** For mobile devices, power efficiency directly impacts battery life and user experience. + +Several key benchmarks have emerged in this space: + +* **SPEC Power:** Introduced in 2007, [SPEC Power](https://www.spec.org/power/) was one of the first industry-standard benchmarks for evaluating the power and performance characteristics of computer servers. +* **Green500:** The [Green500](https://top500.org/lists/green500/) list ranks supercomputers by energy efficiency, complementing the performance-focused TOP500 list. +* **Energy Star:** While not a benchmark per se, [ENERGY STAR for Computers](https://www.energystar.gov/products/computers) certification program has driven manufacturers to improve the energy efficiency of consumer electronics. + +Power benchmarking faces unique challenges, such as accounting for different workloads and system configurations, and measuring power consumption accurately across a range of hardware that scales from microWatts to megaWatts in power consumption. As AI and edge computing continue to grow, power benchmarking is likely to become even more critical, driving the development of specialized energy-efficient AI hardware and software optimizations. ### Custom Benchmarks @@ -305,9 +323,9 @@ It is important to carefully consider these factors when designing benchmarks to Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for training machine learning systems. -*[MLPerf Training Benchmark](https://github.com/mlcommons/training)* +* [MLPerf Training Benchmark](https://github.com/mlcommons/training)* -MLPerf is a suite of benchmarks designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark [@mattson2020mlperf] focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning. @fig-perf-trend highlights the performance improvements in progressive versions of MLPerf Training benchmarks, which have all outpaced Moore's Law. Using standardized benchamrking trends enable us to rigorously showcase the rapid evolution of ML computing. +MLPerf is a suite of benchmarks designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark [@mattson2020mlperf] focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning. @fig-perf-trend highlights the performance improvements in progressive versions of MLPerf Training benchmarks, which have all outpaced Moore's Law. Using standardized benchmarking trends enables us to rigorously showcase the rapid evolution of ML computing. ![MLPerf Training performance trends. Source: @mattson2020mlperf.](images/png/mlperf_perf_trend.png){#fig-perf-trend} @@ -317,7 +335,7 @@ Metrics: * Throughput (examples per second) * Resource utilization (CPU, GPU, memory, disk I/O) -*[DAWNBench](https://dawn.cs.stanford.edu/benchmark/)* +* [DAWNBench](https://dawn.cs.stanford.edu/benchmark/)* DAWNBench [@coleman2017dawnbench] is a benchmark suite focusing on end-to-end deep learning training time and inference performance. It includes common tasks such as image classification and question answering. @@ -327,7 +345,7 @@ Metrics: * Inference latency * Cost (in terms of cloud computing and storage resources) -*[Fathom](https://github.com/rdadolf/fathom)* +* [Fathom](https://github.com/rdadolf/fathom)* Fathom [@adolf2016fathom] is a benchmark from Harvard University that evaluates the performance of deep learning models using a diverse set of workloads. These include common tasks such as image classification, speech recognition, and language modeling. @@ -463,6 +481,18 @@ Get ready to put your AI models to the ultimate test! MLPerf is like the Olympic ::: +### Measuring Energy Efficiency + +As machine learning capabilities expand, both in training and inference, concerns about increased power consumption and its ecological footprint have intensified. Addressing the sustainability of ML systems, a topic explored in more depth in the [Sustainable AI](../sustainable_ai/sustainable_ai.qmd) chapter, has thus become a key priority. This focus on sustainability has led to the development of standardized benchmarks designed to accurately measure energy efficiency. However, standardizing these methodologies poses challenges due to the need to accommodate vastly different scales—from the microwatt consumption of TinyML devices to the megawatt demands of data center training systems. Moreover, ensuring that benchmarking is fair and reproducible requires accommodating the diverse range of hardware configurations and architectures in use today. + +One example is the MLPerf Power benchmarking methodology [@tschand2024mlperf], which tackles these challenges by tailoring the methodologies for datacenter, edge inference, and tiny inference systems while measuring power consumption as comprehensively as possible for each scale. This methodology adapts to a variety of hardware, from general-purpose CPUs to specialized AI accelerators, while maintaining uniform measurement principles to ensure that comparisons are both fair and accurate across different platforms. + +@fig-power-diagram illustrates the power measurement boundaries for different system scales, from TinyML devices to inference nodes and training racks. Each example highlights the components within the measurement boundary and those outside it. This setup allows for accurate reflection of the true energy costs associated with running ML workloads across various real-world scenarios, and ensures that the benchmark captures the full spectrum of energy consumption. + +![MLPerf Power system measurement diagram. Source: @tschand2024mlperf.](images/png/power_component_diagram.png){#fig-power-diagram} + +It is important to note that optimizing a system for performance may not lead to the most energy efficient execution. Oftentimes, sacrificing a small amount of performance or accuracy can lead to significant gains in energy efficiency, highlighting the importance of accurately benchmarking power metrics. Future insights from energy efficiency and sustainability benchmarking will enable us to optimize for more sustainable ML systems. + ### Benchmark Example To properly illustrate the components of a systems benchmark, we can look at the keyword spotting benchmark in MLPerf Tiny and explain the motivation behind each decision. @@ -509,14 +539,12 @@ But of all these, the most important challenge is benchmark engineering. #### Hardware Lottery -The hardware lottery, first described by @10.1145/3467017, in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware [@chu2021discovering]. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware they are run on rather than because they are intrinsically superior models. +The hardware lottery, first described by @10.1145/3467017, refers to the situation where a machine learning model's success or efficiency is significantly influenced by its compatibility with the underlying hardware [@chu2021discovering]. Some models perform exceptionally well not because they are intrinsically superior, but because they are optimized for specific hardware characteristics, such as the parallel processing capabilities of Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). For instance, @fig-hardware-lottery compares the performance of models across different hardware platforms. The multi-hardware models show comparable results to "MobileNetV3 Large min" on both the CPU uint8 and GPU configurations. However, these multi-hardware models demonstrate significant performance improvements over the MobileNetV3 Large baseline when run on the EdgeTPU and DSP hardware. This emphasizes the variable efficiency of multi-hardware models in specialized computing environments. ![Accuracy-latency trade-offs of multiple ML models and how they perform on various hardware. Source: @chu2021discovering](images/png/hardware_lottery.png){#fig-hardware-lottery} -For instance, certain machine learning models may be designed and optimized to take advantage of the parallel processing capabilities of specific hardware accelerators, such as Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). As a result, these models might show superior performance when benchmarked on such hardware compared to other models that are not optimized for the hardware. - Hardware lottery can introduce challenges and biases in benchmarking machine learning systems, as the model's performance is not solely dependent on the model's architecture or algorithm but also on the compatibility and synergies with the underlying hardware. This can make it difficult to compare different models fairly and to identify the best model based on its intrinsic merits. It can also lead to a situation where the community converges on models that are a good fit for the popular hardware of the day, potentially overlooking other models that might be superior but incompatible with the current hardware trends. #### Benchmark Engineering @@ -559,15 +587,6 @@ Standardization of benchmarks is another important solution to mitigate benchmar Third-party verification of results can also be valuable in mitigating benchmark engineering. This involves having an independent third party verify the results of a benchmark test to ensure their credibility and reliability. Third-party verification can build confidence in the results and provide a valuable means of validating the performance and capabilities of AI systems. -### Measuring Energy Efficiency -The advancement of ML capabilites has contributed to a significant increase in concerns about the power consumption and resulting ecological impact of ML systems. The critical challenge of sustainable ML systems has driven the development of standardized benchmarking to measure their energy efficiency. However, this introduces the major challenge of providing a standardized methodology to benchmark from the microwatt scale of Tiny systems up to the megawatt scale of datacenter training systems. Furthermore, one must account for the heterogeneity of the hardware and comparability across different systems when designing a fair and reproducable benchmarking methodology. - -One example is the MLPerf Power benchmarking methodology [@tschand2024mlperf], which tackles these challenges by tailoring the methodologies for datacenter, edge inference, and tiny inference systems while measuring power consumption as comprehensively as possible for each scale. The methodology can adapt to different hardware configurations and architectures, from general purpose CPUs to specialized AI accelerators, while maintaining consistent measurement principles to ensure fair comparability. @fig-power-diagram illustrates the measurement considerations for the different scales of systems, which enable measurements to more closely reflect the true energy cost of running ML workloads in real-world scenarios. - -![MLPerf Power system measurement diagram. Source: @tschand2024mlperf.](images/png/power_component_diagram.png){#fig-power-diagram} - -It is important to note that optimizing a system for performance may not lead to the most energy efficient execution. Oftentimes, sacrificing a small amount of performance or accuracy can lead to significant gains in energy efficiency, highlighting the importance of accurately benchmarking power metrics. Future insights from energy efficiency and sustainability benchmarking will enable us to optimize for more sustainable ML systems. - ## Model Benchmarking Benchmarking machine learning models is important for determining the effectiveness and efficiency of various machine learning algorithms in solving specific tasks or problems. By analyzing the results obtained from benchmarking, developers and researchers can identify their models' strengths and weaknesses, leading to more informed decisions on model selection and further optimization. @@ -580,7 +599,7 @@ Machine learning datasets have a rich history and have evolved significantly ove #### MNIST (1998) -The [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist), created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It comprises 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning as a starting point for many researchers and practitioners. +The [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist), created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It comprises 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning as a starting point for many researchers and practitioners. @fig-mnist shows some examples of handwritten digits. ![MNIST handwritten digits. Source: [Suvanjanprasai.](https://en.wikipedia.org/wiki/File:MnistExamplesModified.png)](images/png/mnist.png){#fig-mnist} @@ -592,7 +611,7 @@ Fast forward to 2009, and we see the introduction of the [ImageNet dataset](http The [Common Objects in Context (COCO) dataset](https://cocodataset.org/) [@lin2014microsoft], released in 2014, further expanded the landscape of machine learning datasets by introducing a richer set of annotations. COCO consists of images containing complex scenes with multiple objects, and each image is annotated with object bounding boxes, segmentation masks, and captions, as shown in @fig-coco. This dataset has been instrumental in advancing research in object detection, segmentation, and image captioning. -![Example images from the COCO dataset. Source: [Coco](https://cocodataset.org/).](images/png/coco.png){#fig-coco} +![Coco dataset. Source: Coco.](images/png/coco.png){#fig-coco} #### GPT-3 (2020) @@ -727,7 +746,6 @@ The [Speech Commands dataset](https://arxiv.org/pdf/1804.03209.pdf) and its succ ## Data Benchmarking For the past several years, AI has focused on developing increasingly sophisticated machine learning models like large language models. The goal has been to create models capable of human-level or superhuman performance on a wide range of tasks by training them on massive datasets. This model-centric approach produced rapid progress, with models attaining state-of-the-art results on many established benchmarks. @fig-superhuman-perf shows the performance of AI systems relative to human performance (marked by the horizontal line at 0) across five applications: handwriting recognition, speech recognition, image recognition, reading comprehension, and language understanding. Over the past decade, the AI performance has surpassed that of humans. - ![AI vs human performane. Source: @kiela2021dynabench.](images/png/dynabench.png){#fig-superhuman-perf} However, growing concerns about issues like bias, safety, and robustness persist even in models that achieve high accuracy on standard benchmarks. Additionally, some popular datasets used for evaluating models are beginning to saturate, with models reaching near-perfect performance on existing test splits [@kiela2021dynabench]. As a simple example, there are test images in the classic MNIST handwritten digit dataset that may look indecipherable to most human evaluators but were assigned a label when the dataset was created - models that happen to agree with those labels may appear to exhibit superhuman performance but instead may only be capturing idiosyncrasies of the labeling and acquisition process from the dataset's creation in 1994. In the same spirit, computer vision researchers now ask, "Are we done with ImageNet?" [@beyer2020we]. This highlights limitations in the conventional model-centric approach of optimizing accuracy on fixed datasets through architectural innovations.