diff --git a/contents/ai_for_good/ai_for_good.qmd b/contents/ai_for_good/ai_for_good.qmd index 9374f5ab..04db8594 100644 --- a/contents/ai_for_good/ai_for_good.qmd +++ b/contents/ai_for_good/ai_for_good.qmd @@ -34,12 +34,12 @@ By aligning AI progress with human values, goals, and ethics, the ultimate goal To give ourselves a framework around which to think about AI for social good, we will be following the UN Sustainable Development Goals (SDGs). The UN SDGs are a collection of 17 global goals, shown in @fig-sdg, adopted by the United Nations in 2015 as part of the 2030 Agenda for Sustainable Development. The SDGs address global challenges related to poverty, inequality, climate change, environmental degradation, prosperity, and peace and justice. +![United Nations Sustainable Development Goals (SDG). Source: [United Nations](https://sdgs.un.org/goals).](https://www.un.org/sustainabledevelopment/wp-content/uploads/2015/12/english_SDG_17goals_poster_all_languages_with_UN_emblem_1.png){#fig-sdg} + What is special about the SDGs is that they are a collection of interlinked objectives designed to serve as a "shared blueprint for peace and prosperity for people and the planet, now and into the future." The SDGs emphasize sustainable development's interconnected environmental, social, and economic aspects by putting sustainability at their center. A recent study [@vinuesa2020role] highlights the influence of AI on all aspects of sustainable development, particularly on the 17 Sustainable Development Goals (SDGs) and 169 targets internationally defined in the 2030 Agenda for Sustainable Development. The study shows that AI can act as an enabler for 134 targets through technological improvements, but it also highlights the challenges of AI on some targets. The study shows that AI can benefit 67 targets when considering AI and societal outcomes. Still, it also warns about the issues related to the implementation of AI in countries with different cultural values and wealth. -![United Nations Sustainable Development Goals (SDG). Source: [United Nations](https://sdgs.un.org/goals).](https://www.un.org/sustainabledevelopment/wp-content/uploads/2015/12/english_SDG_17goals_poster_all_languages_with_UN_emblem_1.png){#fig-sdg} - In our book's context, TinyML could help advance at least some of these SDG goals. * **Goal 1 - No Poverty:** TinyML could help provide low-cost solutions for crop monitoring to improve agricultural yields in developing countries. diff --git a/contents/benchmarking/benchmarking.bib b/contents/benchmarking/benchmarking.bib index 9b0cd0e6..678f09b2 100644 --- a/contents/benchmarking/benchmarking.bib +++ b/contents/benchmarking/benchmarking.bib @@ -71,6 +71,25 @@ @inproceedings{brown2020language year = {2020}, } +@article{10.1145/3467017, +author = {Hooker, Sara}, +title = {The hardware lottery}, +year = {2021}, +issue_date = {December 2021}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {64}, +number = {12}, +issn = {0001-0782}, +url = {https://doi.org/10.1145/3467017}, +doi = {10.1145/3467017}, +abstract = {After decades of incentivizing the isolation of hardware, software, and algorithm development, the catalysts for closer collaboration are changing the paradigm.}, +journal = {Commun. ACM}, +month = nov, +pages = {58–65}, +numpages = {8} +} + @inproceedings{chu2021discovering, author = {Chu, Grace and Arikan, Okan and Bender, Gabriel and Wang, Weijun and Brighton, Achille and Kindermans, Pieter-Jan and Liu, Hanxiao and Akin, Berkin and Gupta, Suyog and Howard, Andrew}, bibsource = {dblp computer science bibliography, https://dblp.org}, diff --git a/contents/benchmarking/benchmarking.qmd b/contents/benchmarking/benchmarking.qmd index febebceb..bb78ea52 100644 --- a/contents/benchmarking/benchmarking.qmd +++ b/contents/benchmarking/benchmarking.qmd @@ -505,15 +505,15 @@ But of all these, the most important challenge is benchmark engineering. #### Hardware Lottery -The ["hardware lottery"](https://arxiv.org/abs/2009.06489) in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware [@chu2021discovering]. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware they are run on rather than because they are intrinsically superior models. +The hardware lottery, first described by @10.1145/3467017, in benchmarking machine learning systems refers to the situation where the success or efficiency of a machine learning model is significantly influenced by the compatibility of the model with the underlying hardware [@chu2021discovering]. In other words, some models perform exceptionally well because they are a good fit for the particular characteristics or capabilities of the hardware they are run on rather than because they are intrinsically superior models. -![Hardware Lottery.](images/png/hardware_lottery.png){#fig-hardware-lottery} +For instance, @fig-hardware-lottery compares the performance of models across different hardware platforms. The multi-hardware models show comparable results to "MobileNetV3 Large min" on both the CPU uint8 and GPU configurations. However, these multi-hardware models demonstrate significant performance improvements over the MobileNetV3 Large baseline when run on the EdgeTPU and DSP hardware. This emphasizes the variable efficiency of multi-hardware models in specialized computing environments. -For instance, certain machine learning models may be designed and optimized to take advantage of the parallel processing capabilities of specific hardware accelerators, such as Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). As a result, these models might show superior performance when benchmarked on such hardware compared to other models that are not optimized for the hardware. +![Accuracy-latency trade-offs of multiple ML models and how they perform on various hardware. Source: @chu2021discovering](images/png/hardware_lottery.png){#fig-hardware-lottery} -For example, a 2018 paper introduced a new convolutional neural network architecture for image classification that achieved state-of-the-art accuracy on ImageNet. However, the paper only mentioned that the model was trained on 8 GPUs without specifying the model, memory size, or other relevant details. A follow-up study tried to reproduce the results but found that training the same model on commonly available GPUs achieved 10% lower accuracy, even after hyperparameter tuning. The original hardware likely had far higher memory bandwidth and compute power. As another example, training times for large language models can vary drastically based on the GPUs used. +For instance, certain machine learning models may be designed and optimized to take advantage of the parallel processing capabilities of specific hardware accelerators, such as Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). As a result, these models might show superior performance when benchmarked on such hardware compared to other models that are not optimized for the hardware. -The "hardware lottery" can introduce challenges and biases in benchmarking machine learning systems, as the model's performance is not solely dependent on the model's architecture or algorithm but also on the compatibility and synergies with the underlying hardware. This can make it difficult to compare different models fairly and to identify the best model based on its intrinsic merits. It can also lead to a situation where the community converges on models that are a good fit for the popular hardware of the day, potentially overlooking other models that might be superior but incompatible with the current hardware trends. +Hardware lottery can introduce challenges and biases in benchmarking machine learning systems, as the model's performance is not solely dependent on the model's architecture or algorithm but also on the compatibility and synergies with the underlying hardware. This can make it difficult to compare different models fairly and to identify the best model based on its intrinsic merits. It can also lead to a situation where the community converges on models that are a good fit for the popular hardware of the day, potentially overlooking other models that might be superior but incompatible with the current hardware trends. #### Benchmark Engineering @@ -567,19 +567,19 @@ Machine learning datasets have a rich history and have evolved significantly ove #### MNIST (1998) -The [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist), created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It comprises 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning as a starting point for many researchers and practitioners. @fig-mnist shows some examples of handwritten digits. +The [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist), created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It comprises 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning as a starting point for many researchers and practitioners. ![MNIST handwritten digits. Source: [Suvanjanprasai.](https://en.wikipedia.org/wiki/File:MnistExamplesModified.png)](images/png/mnist.png){#fig-mnist} #### ImageNet (2009) -Fast forward to 2009, and we see the introduction of the [ImageNet dataset](https://www.tensorflow.org/datasets/catalog/imagenet2012), which marked a significant leap in the scale and complexity of datasets. ImageNet consists of over 14 million labeled images spanning more than 20,000 categories. Fei-Fei Li and her team developed it to advance object recognition and computer vision research. The dataset became synonymous with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition crucial in developing deep learning models, including the famous AlexNet in 2012. +Fast forward to 2009, and we see the introduction of the [ImageNet dataset](https://www.tensorflow.org/datasets/catalog/imagenet2012), which marked a significant leap in the scale and complexity of datasets. ImageNet consists of over 14 million labeled images spanning more than 20,000 categories. Fei-Fei Li and her team developed it to advance object recognition and computer vision research. The dataset became synonymous with the ImageNet [Large Scale Visual Recognition Challenge (LSVRC)](https://www.image-net.org/challenges/LSVRC/), an annual competition crucial in developing deep learning models, including the famous AlexNet in 2012. #### COCO (2014) -The [Common Objects in Context (COCO) dataset](https://cocodataset.org/) [@lin2014microsoft], released in 2014, further expanded the landscape of machine learning datasets by introducing a richer set of annotations. COCO consists of images containing complex scenes with multiple objects, and each image is annotated with object bounding boxes, segmentation masks, and captions. This dataset has been instrumental in advancing research in object detection, segmentation, and image captioning. +The [Common Objects in Context (COCO) dataset](https://cocodataset.org/) [@lin2014microsoft], released in 2014, further expanded the landscape of machine learning datasets by introducing a richer set of annotations. COCO consists of images containing complex scenes with multiple objects, and each image is annotated with object bounding boxes, segmentation masks, and captions, as shown in @fig-coco. This dataset has been instrumental in advancing research in object detection, segmentation, and image captioning. -![Coco dataset. Source: Coco.](images/png/coco.png){#fig-coco} +![Example images from the COCO dataset. Source: [Coco](https://cocodataset.org/).](images/png/coco.png){#fig-coco} #### GPT-3 (2020) @@ -637,7 +637,7 @@ Ensuring fairness in machine learning models, particularly in applications that #### Complexity -##### Parameters* +##### Parameters In the initial stages of machine learning, model benchmarking often relied on parameter counts as a proxy for model complexity. The rationale was that more parameters typically lead to a more complex model, which should, in turn, deliver better performance. However, this approach has proven inadequate as it needs to account for the computational cost associated with processing many parameters. @@ -651,13 +651,11 @@ In light of these limitations, the field has moved towards a more holistic appro The size of a machine learning model is an essential aspect that directly impacts its usability in practical scenarios, especially when computational resources are limited. Traditionally, the number of parameters in a model was often used as a proxy for its size, with the underlying assumption being that more parameters would translate to better performance. However, this simplistic view does not consider the computational cost of processing these parameters. This is where the concept of floating-point operations per second (FLOPs) comes into play, providing a more accurate representation of the computational load a model imposes. -FLOPs measure the number of floating-point operations a model performs to generate a prediction. A model with many FLOPs requires substantial computational resources to process the vast number of operations, which may render it impractical for certain applications. Conversely, a model with a lower FLOP count is more lightweight and can be easily deployed in scenarios where computational resources are limited. - -@fig-flops, from [@bianco2018benchmark], shows the relationship between Top-1 Accuracy on ImageNet (y-axis), the model's G-FLOPs (x-axis), and the model's parameter count (circle-size). +FLOPs measure the number of floating-point operations a model performs to generate a prediction. A model with many FLOPs requires substantial computational resources to process the vast number of operations, which may render it impractical for certain applications. Conversely, a model with a lower FLOP count is more lightweight and can be easily deployed in scenarios where computational resources are limited. @fig-flops, from [@bianco2018benchmark], shows the relationship between Top-1 Accuracy on ImageNet (_y_-axis), the model's G-FLOPs (_x_-axis), and the model's parameter count (circle-size). ![A graph that depicts the top-1 imagenet accuracy vs. the FLOP count of a model along with the model's parameter count. The figure shows a overall tradeoff between model complexity and accuracy, although some model architectures are more efficiency than others. Source: @bianco2018benchmark.](images/png/model_FLOPS_VS_TOP_1.png){#fig-flops} -Let's consider an example. BERT [Bidirectional Encoder Representations from Transformers] [@devlin2018bert], a popular natural language processing model, has over 340 million parameters, making it a large model with high accuracy and impressive performance across various tasks. However, the sheer size of BERT, coupled with its high FLOP count, makes it a computationally intensive model that may not be suitable for real-time applications or deployment on edge devices with limited computational capabilities. +Let's consider an example. BERT---Bidirectional Encoder Representations from Transformers [@devlin2018bert]---is a popular natural language processing model, has over 340 million parameters, making it a large model with high accuracy and impressive performance across various tasks. However, the sheer size of BERT, coupled with its high FLOP count, makes it a computationally intensive model that may not be suitable for real-time applications or deployment on edge devices with limited computational capabilities. In light of this, there has been a growing interest in developing smaller models that can achieve similar performance levels as their larger counterparts while being more efficient in computational load. DistilBERT, for instance, is a smaller version of BERT that retains 97% of its performance while being 40% smaller in terms of parameter count. The size reduction also translates to a lower FLOP count, making DistilBERT a more practical choice for resource-constrained scenarios. @@ -717,10 +715,10 @@ The [Speech Commands dataset](https://arxiv.org/pdf/1804.03209.pdf) and its succ For the past several years, AI has focused on developing increasingly sophisticated machine learning models like large language models. The goal has been to create models capable of human-level or superhuman performance on a wide range of tasks by training them on massive datasets. This model-centric approach produced rapid progress, with models attaining state-of-the-art results on many established benchmarks. @fig-superhuman-perf shows the performance of AI systems relative to human performance (marked by the horizontal line at 0) across five applications: handwriting recognition, speech recognition, image recognition, reading comprehension, and language understanding. Over the past decade, the AI performance has surpassed that of humans. -However, growing concerns about issues like bias, safety, and robustness persist even in models that achieve high accuracy on standard benchmarks. Additionally, some popular datasets used for evaluating models are beginning to saturate, with models reaching near-perfect performance on existing test splits [@kiela2021dynabench]. As a simple example, there are test images in the classic MNIST handwritten digit dataset that may look indecipherable to most human evaluators but were assigned a label when the dataset was created - models that happen to agree with those labels may appear to exhibit superhuman performance but instead may only be capturing idiosyncrasies of the labeling and acquisition process from the dataset's creation in 1994. In the same spirit, computer vision researchers now ask, "Are we done with ImageNet?" [@beyer2020we]. This highlights limitations in the conventional model-centric approach of optimizing accuracy on fixed datasets through architectural innovations. - ![AI vs human performane. Source: @kiela2021dynabench.](images/png/dynabench.png){#fig-superhuman-perf} +However, growing concerns about issues like bias, safety, and robustness persist even in models that achieve high accuracy on standard benchmarks. Additionally, some popular datasets used for evaluating models are beginning to saturate, with models reaching near-perfect performance on existing test splits [@kiela2021dynabench]. As a simple example, there are test images in the classic MNIST handwritten digit dataset that may look indecipherable to most human evaluators but were assigned a label when the dataset was created - models that happen to agree with those labels may appear to exhibit superhuman performance but instead may only be capturing idiosyncrasies of the labeling and acquisition process from the dataset's creation in 1994. In the same spirit, computer vision researchers now ask, "Are we done with ImageNet?" [@beyer2020we]. This highlights limitations in the conventional model-centric approach of optimizing accuracy on fixed datasets through architectural innovations. + An alternative paradigm is emerging called data-centric AI. Rather than treating data as static and focusing narrowly on model performance, this approach recognizes that models are only as good as their training data. So, the emphasis shifts to curating high-quality datasets that better reflect real-world complexity, developing more informative evaluation benchmarks, and carefully considering how data is sampled, preprocessed, and augmented. The goal is to optimize model behavior by improving the data rather than just optimizing metrics on flawed datasets. Data-centric AI critically examines and enhances the data itself to produce beneficial AI. This reflects an important evolution in mindset as the field addresses the shortcomings of narrow benchmarking. This section will explore the key differences between model-centric and data-centric approaches to AI. This distinction has important implications for how we benchmark AI systems. Specifically, we will see how focusing on data quality and Efficiency can directly improve machine learning performance as an alternative to optimizing model architectures solely. The data-centric approach recognizes that models are only as good as their training data. So, enhancing data curation, evaluation benchmarks, and data handling processes can produce AI systems that are safer, fairer, and more robust. Rethinking benchmarking to prioritize data alongside models represents an important evolution as the field strives to deliver trustworthy real-world impact. diff --git a/contents/data_engineering/data_engineering.qmd b/contents/data_engineering/data_engineering.qmd index 2d88b7de..4613a084 100644 --- a/contents/data_engineering/data_engineering.qmd +++ b/contents/data_engineering/data_engineering.qmd @@ -45,18 +45,20 @@ We begin by discussing data collection: Where do we source data, and how do we g ## Problem Definition -In many machine learning domains, sophisticated algorithms take center stage, while the fundamental importance of data quality is often overlooked. This neglect gives rise to ["Data Cascades"](https://research.google/pubs/pub49953/) by @sambasivan2021everyone (see @fig-cascades)—events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities. In @fig-cascades, we have an illustration of potential data pitfalls at every stage and how they influence the entire process down the line. The influence of data collection errors is especially pronounced. Any lapses in this stage will become apparent at later stages (in model evaluation and deployment) and might lead to costly consequences, such as abandoning the entire model and restarting anew. Therefore, investing in data engineering techniques from the onset will help us detect errors early. +In many machine learning domains, sophisticated algorithms take center stage, while the fundamental importance of data quality is often overlooked. This neglect gives rise to ["Data Cascades"](https://research.google/pubs/pub49953/) by @sambasivan2021everyone—events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities. + +@fig-cascades illustrates these potential data pitfalls at every stage and how they influence the entire process down the line. The influence of data collection errors is especially pronounced. As depicted in the figure, any lapses in this initial stage will become apparent at later stages (in model evaluation and deployment) and might lead to costly consequences, such as abandoning the entire model and restarting anew. Therefore, investing in data engineering techniques from the onset will help us detect errors early, mitigating the cascading effects illustrated in the figure. ![Data cascades: compounded costs. Source: @sambasivan2021everyone.](images/png/data_engineering_cascades.png){#fig-cascades} Despite many ML professionals recognizing the importance of data, numerous practitioners report facing these cascades. This highlights a systemic issue: while the allure of developing advanced models remains, data often needs to be more appreciated. -Take, for example, Keyword Spotting (KWS) (see @fig-keywords). KWS is a prime example of TinyML in action and is a critical technology behind voice-enabled interfaces on endpoint devices such as smartphones. Typically functioning as lightweight wake-word engines, these systems are consistently active, listening for a specific phrase to trigger further actions. When we say "OK, Google" or "Alexa," this initiates a process on a microcontroller embedded within the device. Despite their limited resources, these microcontrollers play an important role in enabling seamless voice interactions with devices, often operating in environments with high ambient noise. The uniqueness of the wake word helps minimize false positives, ensuring that the system is not triggered inadvertently. - -It is important to appreciate that these keyword-spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as glass breaking. This evolution is geared towards creating intelligent devices capable of understanding and responding to vocal commands, heralding a future where even household appliances can be controlled through voice interactions. +Keyword Spotting (KWS) provides an excellent example of TinyML in action, as illustrated in @fig-keywords. This technology is critical for voice-enabled interfaces on endpoint devices such as smartphones. Typically functioning as lightweight wake-word engines, KWS systems are consistently active, listening for a specific phrase to trigger further actions. As depicted in the figure, when we say "OK, Google" or "Alexa," this initiates a process on a microcontroller embedded within the device. Despite their limited resources, these microcontrollers play an important role in enabling seamless voice interactions with devices, often operating in environments with high ambient noise. The uniqueness of the wake word, as shown in the figure, helps minimize false positives, ensuring that the system is not triggered inadvertently. ![Keyword Spotting example: interacting with Alexa. Source: Amazon.](images/png/data_engineering_kws.png){#fig-keywords} +It is important to appreciate that these keyword-spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as glass breaking. This evolution is geared towards creating intelligent devices capable of understanding and responding to vocal commands, heralding a future where even household appliances can be controlled through voice interactions. + Building a reliable KWS model is a complex task. It demands a deep understanding of the deployment scenario, encompassing where and how these devices will operate. For instance, a KWS model's effectiveness is not just about recognizing a word; it's about discerning it among various accents and background noises, whether in a bustling cafe or amid the blaring sound of a television in a living room or a kitchen where these devices are commonly found. It's about ensuring that a whispered "Alexa" in the dead of night or a shouted "OK Google" in a noisy marketplace are recognized with equal precision. Moreover, many current KWS voice assistants support a limited number of languages, leaving a substantial portion of the world's linguistic diversity unrepresented. This limitation is partly due to the difficulty in gathering and monetizing data for languages spoken by smaller populations. The long-tail distribution of languages implies that many languages have limited data, making the development of supportive technologies challenging. @@ -125,12 +127,12 @@ In this context, using KWS as an example, we can break each of the steps out as ### Keyword Spotting with TensorFlow Lite Micro -Explore a hands-on guide for building and deploying Keyword Spotting (KWS) systems using TensorFlow Lite Micro. Follow steps from data collection to model training and deployment to microcontrollers. Learn to create efficient KWS models that recognize specific keywords amidst background noise. Perfect for those interested in machine learning on embedded systems. Unlock the potential of voice-enabled devices with TensorFlow Lite Micro! +Explore a hands-on guide for building and deploying Keyword Spotting systems using TensorFlow Lite Micro. Follow steps from data collection to model training and deployment to microcontrollers. Learn to create efficient KWS models that recognize specific keywords amidst background noise. Perfect for those interested in machine learning on embedded systems. Unlock the potential of voice-enabled devices with TensorFlow Lite Micro! [![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/drive/17I7GL8WTieGzXYKRtQM2FrFi3eLQIrOM) ::: -The current chapter underscores the essential role of data quality in ML, using Keyword Spotting (KWS) systems as an example. It outlines key steps, from problem definition to stakeholder engagement, emphasizing iterative feedback. The forthcoming chapter will dig deeper into data quality management, discussing its consequences and future trends, focusing on the importance of high-quality, diverse data in AI system development, addressing ethical considerations and data sourcing methods. +The current chapter underscores the essential role of data quality in ML, using Keyword Spotting systems as an example. It outlines key steps, from problem definition to stakeholder engagement, emphasizing iterative feedback. The forthcoming chapter will dig deeper into data quality management, discussing its consequences and future trends, focusing on the importance of high-quality, diverse data in AI system development, addressing ethical considerations and data sourcing methods. ## Data Sourcing @@ -144,10 +146,12 @@ The quality assurance that comes with popular pre-existing datasets is important While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, it's essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes, these [datasets do not reflect the real-world data](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/). -In addition, bias, validity, and reproducibility issues may exist in these datasets, and there has been a growing awareness of these issues in recent years. Furthermore, using the same dataset to train different models as shown in @fig-misalignment can sometimes create misalignment: training multiple models using the same dataset results in a 'misalignment' between the models and the world, in which an entire ecosystem of models reflects only a narrow subset of the real-world data. +In recent years, there has been growing awareness of bias, validity, and reproducibility issues that may exist in machine learning datasets. @fig-misalignment illustrates another critical concern: the potential for misalignment when using the same dataset to train different models. ![Training different models on the same dataset. Source: (icons from left to right: Becris; Freepik; Freepik; Paul J; SBTS2018).](images/png/dataset_myopia.png){#fig-misalignment} +As shown in @fig-misalignment, training multiple models using the same dataset can result in a 'misalignment' between the models and the world. This misalignment creates an entire ecosystem of models that reflects only a narrow subset of the real-world data. Such a scenario can lead to limited generalization and potentially biased outcomes across various applications using these models. + ### Web Scraping Web scraping refers to automated techniques for extracting data from websites. It typically involves sending HTTP requests to web servers, retrieving HTML content, and parsing that content to extract relevant information. Popular tools and frameworks for web scraping include Beautiful Soup, Scrapy, and Selenium. These tools offer different functionalities, from parsing HTML content to automating web browser interactions, especially for websites that load content dynamically using JavaScript. @@ -201,7 +205,11 @@ Thus, while crowdsourcing can work well in many cases, the specialized needs of ### Synthetic Data -Synthetic data generation can be useful for addressing some of the data collection limitations. It involves creating data that wasn't originally captured or observed but is generated using algorithms, simulations, or other techniques to resemble real-world data. As shown in @fig-synthetic-data, synthetic data is merged with historical data and then used as input for model training. It has become a valuable tool in various fields, particularly when real-world data is scarce, expensive, or ethically challenging (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable. +Synthetic data generation can be a valuable solution for addressing data collection limitations. @fig-synthetic-data illustrates how this process works: synthetic data is merged with historical data to create a larger, more diverse dataset for model training. + +![Increasing training data size with synthetic data generation. Source: [AnyLogic](https://www.anylogic.com/features/artificial-intelligence/synthetic-data/).](images/jpg/synthetic_data.jpg){#fig-synthetic-data} + +As shown in the figure, synthetic data involves creating information that wasn't originally captured or observed but is generated using algorithms, simulations, or other techniques to resemble real-world data. This approach has become particularly valuable in fields where real-world data is scarce, expensive, or ethically challenging to obtain, such as in TinyML applications. Various techniques, including Generative Adversarial Networks (GANs), can produce high-quality synthetic data almost indistinguishable from real data. These methods have advanced significantly, making synthetic data generation increasingly realistic and reliable. More real-world data may need to be available for analysis or training machine learning models in many domains, especially emerging ones. Synthetic data can fill this gap by producing large volumes of data that mimic real-world scenarios. For instance, detecting the sound of breaking glass might be challenging in security applications where a TinyML device is trying to identify break-ins. Collecting real-world data would require breaking numerous windows, which is impractical and costly. @@ -215,8 +223,6 @@ Many embedded use cases deal with unique situations, such as manufacturing plant While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases. -![Increasing training data size with synthetic data generation. Source: [AnyLogic](https://www.anylogic.com/features/artificial-intelligence/synthetic-data/).](images/jpg/synthetic_data.jpg){#fig-synthetic-data} - :::{#exr-sd .callout-caution collapse="true"} ### Synthetic Data @@ -247,12 +253,16 @@ Data sourcing and data storage go hand in hand, and data must be stored in a for : Comparative overview of the database, data warehouse, and data lake. {#tbl-storage .striped .hover} -The stored data is often accompanied by metadata, defined as 'data about data.' It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license, etc. For example, [Hugging Face](https://huggingface.co/) has [Dataset Cards](https://huggingface.co/docs/hub/datasets-cards). To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates, or analytical results. @fig-data-collection showcases the pillars of data collection and their collection methods. +The stored data is often accompanied by metadata, defined as 'data about data. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license, etc. @fig-data-collection illustrates the key pillars of data collection and their associated methods, highlighting the importance of structured data management. For example, [Hugging Face](https://huggingface.co/) has implemented [Dataset Cards](https://huggingface.co/docs/hub/datasets-cards) to promote responsible data use. These cards, which align with the documentation pillar shown in @fig-data-collection, allow dataset creators to disclose potential biases and educate users about a dataset's contents and limitations. + +The dataset cards provide important context on appropriate dataset usage by highlighting biases and other important details. Having this type of structured metadata can also allow for fast retrieval, aligning with the efficient data management principles illustrated in the figure. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates, or analytical results, potentially utilizing methods from multiple pillars shown in @fig-data-collection. This ongoing data collection and management process ensures that the model remains up-to-date and relevant in its operational environment. ![Pillars of data collection. Source: [Alexsoft](https://www.altexsoft.com/blog/data-collection-machine-learning/)](images/png/datacollection.png){#fig-data-collection} **Data Governance:** With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that help manage data during its life cycle, from acquisition to disposal. Data governance outlines how data is managed and includes making key decisions about data access and control. @fig-governance illustrates the different domains involved in data governance. It involves exercising authority and making decisions concerning data to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized by developing policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and related risks. +![An overview of the data governance framework. Source: [StarCIO.](https://www.groundwatergovernance.org/the-importance-of-governance-for-all-stakeholders/).](images/jpg/data_governance.jpg){#fig-governance} + Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based. * **The planning and control approach**, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. @@ -261,8 +271,6 @@ Data governance utilizes three integrative approaches: planning and control, org * **The risk-based approach**, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms. It especially addresses AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts. -![An overview of the data governance framework. Source: [StarCIO.](https://www.groundwatergovernance.org/the-importance-of-governance-for-all-stakeholders/).](images/jpg/data_governance.jpg){#fig-governance} - Some examples of data governance across different sectors include: * **Medicine:** [Health Information Exchanges(HIEs)](https://www.healthit.gov/topic/health-it-and-health-information-exchange-basics/what-hie) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the [Health Insurance Portability and Accountability Act (HIPAA)](https://www.cdc.gov/phlp/publications/topic/hipaa.html). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information. @@ -302,7 +310,10 @@ Data often comes from diverse sources and can be unstructured or semi-structured * Using techniques like dimensionality reduction Data validation serves a broader role than ensuring adherence to certain standards, like preventing temperature values from falling below absolute zero. These issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings; such transients are not uncommon. Therefore, it is imperative to catch data errors early before propagating through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them. -Let's take a look at @fig-data-engineering-kws2 for an example of a data processing pipeline. In the context of TinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. The input data (which's a collection of short recordings) goes through several phases of processing, such as audio-word alignement and keyword extraction. By streamlining the data flow, from raw data to usable datasets, data pipelines improve productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage. + +Let's take a look at @fig-data-engineering-kws2 for an example of a data processing pipeline. In the context of TinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. The input data (which's a collection of short recordings) goes through several phases of processing, such as audio-word alignement and keyword extraction. + +MSWC streamlines the data flow, from raw data to usable datasets, data pipelines improve productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage. ![An overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline. Source: @mazumder2021multilingual.](images/png/data_engineering_kws2.png){#fig-data-engineering-kws2} @@ -333,7 +344,8 @@ Labels capture information about key tasks or concepts. @fig-labels includes som Unless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Given their unique resource constraints, dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for, such as pedestrian detection. -Additionally, annotators can provide metadata that provides insight into how the dataset represents different characteristics of interest (see @sec-data-transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented [@ardila2020common]. They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. +Additionally, annotators can provide metadata that provides insight into how the dataset represents different characteristics of interest (see @sec-data-transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented [@ardila2020common]. They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. + Additionally, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages. Next, creators must determine the format of those labels. For example, a creator interested in car detection might choose between binary classification labels that say whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format may depend on their use case and resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire. @@ -376,14 +388,14 @@ ML has an insatiable demand for data. Therefore, more data is needed. This raise * **Active learning:** AI models can identify the most informative data points in a dataset, which can then be prioritized for human annotation. This can help improve the labeled dataset's quality while reducing the overall annotation time. * **Quality control:** AI models can identify and flag potential errors in human annotations, helping to ensure the accuracy and consistency of the labeled dataset. +![Strategies for acquiring additional labeled training data. Source: [Standford AI Lab.](https://ai.stanford.edu/blog/weak-supervision/)](https://ai.stanford.edu/blog//assets/img/posts/2019-03-03-weak_supervision/WS_mapping.png){#fig-weak-supervision} + Here are some examples of how AI-assisted annotation has been proposed to be useful: * **Medical imaging:** AI-assisted annotation labels medical images, such as MRI scans and X-rays [@krishnan2022selfsupervised]. Carefully annotating medical datasets is extremely challenging, especially at scale, since domain experts are scarce and become costly. This can help to train AI models to diagnose diseases and other medical conditions more accurately and efficiently. * **Self-driving cars:** AI-assisted annotation is being used to label images and videos from self-driving cars. This can help to train AI models to identify objects on the road, such as other vehicles, pedestrians, and traffic signs. * **Social media:** AI-assisted annotation labels social media posts like images and videos. This can help to train AI models to identify and classify different types of content, such as news, advertising, and personal posts. -![Strategies for acquiring additional labeled training data. Source: [Standford AI Lab.](https://ai.stanford.edu/blog/weak-supervision/)](https://ai.stanford.edu/blog//assets/img/posts/2019-03-03-weak_supervision/WS_mapping.png){#fig-weak-supervision} - ## Data Version Control Production systems are perpetually inundated with fluctuating and escalating volumes of data, prompting the rapid emergence of numerous data replicas. This increasing data serves as the foundation for training machine learning models. For instance, a global sales company engaged in sales forecasting continuously receives consumer behavior data. Similarly, healthcare systems formulating predictive models for disease diagnosis are consistently acquiring new patient data. TinyML applications, such as keyword spotting, are highly data-hungry regarding the amount of data generated. Consequently, meticulous tracking of data versions and the corresponding model performance is imperative. @@ -394,8 +406,7 @@ Data Version Control offers a structured methodology to handle alterations and v **Collaboration and Efficiency:** Easy access to different dataset versions in one place can improve data sharing of specific checkpoints and enable efficient collaboration. -**Reproducibility:** Data version control allows for tracking the performance of models concerning different versions of the data, -and therefore enabling reproducibility. +**Reproducibility:** Data version control allows for tracking the performance of models concerning different versions of the data, and therefore enabling reproducibility. **Key Concepts** @@ -411,7 +422,7 @@ With data version control in place, we can track the changes shown in @fig-data- **Popular Data Version Control Systems** -[**[DVC]{.underline}**](https://dvc.org/doc): It stands for Data Version Control in short and is an open-source, lightweight tool that works on top of Git Hub and supports all kinds of data formats. It can seamlessly integrate into the workflow if Git is used to manage code. It captures the versions of data and models in the Git commits while storing them on-premises or on the cloud (e.g., AWS, Google Cloud, Azure). These data and models (e.g., ML artifacts) are defined in the metadata files, which get updated in every commit. It can allow metrics tracking of models on different versions of the data. +[**[DVC]**](https://dvc.org/doc): It stands for Data Version Control in short and is an open-source, lightweight tool that works on top of Git Hub and supports all kinds of data formats. It can seamlessly integrate into the workflow if Git is used to manage code. It captures the versions of data and models in the Git commits while storing them on-premises or on the cloud (e.g., AWS, Google Cloud, Azure). These data and models (e.g., ML artifacts) are defined in the metadata files, which get updated in every commit. It can allow metrics tracking of models on different versions of the data. **[lakeFS](https://docs.lakefs.io/):** It is an open-source tool that supports the data version control on data lakes. It supports many git-like operations, such as branching and merging of data, as well as reverting to previous versions of the data. It also has a unique UI feature, making exploring and managing data much easier. diff --git a/contents/dl_primer/dl_primer.qmd b/contents/dl_primer/dl_primer.qmd index 8b7c748f..0d3b4dd7 100644 --- a/contents/dl_primer/dl_primer.qmd +++ b/contents/dl_primer/dl_primer.qmd @@ -34,19 +34,21 @@ The primer explores major deep learning architectures from a systems perspective Deep learning, a specialized area within machine learning and artificial intelligence (AI), utilizes algorithms modeled after the structure and function of the human brain, known as artificial neural networks. This field is a foundational element in AI, driving progress in diverse sectors such as computer vision, natural language processing, and self-driving vehicles. Its significance in embedded AI systems is highlighted by its capability to handle intricate calculations and predictions, optimizing the limited resources in embedded settings. -@fig-ai-ml-dl provides a visual representation of how deep learning fits within the broader context of AI and machine learning. The diagram illustrates the chronological development and relative segmentation of these three interconnected fields, showcasing deep learning as a specialized subset of machine learning, which in turn is a subset of AI. - -As depicted in the figure, AI represents the overarching field, encompassing all computational methods that mimic human cognitive functions. Machine learning, shown as a subset of AI, includes algorithms capable of learning from data. Deep learning, the smallest subset in the diagram, specifically involves neural networks that are able to learn more complex patterns from large volumes of data. +@fig-ai-ml-dl provides a visual representation of how deep learning fits within the broader context of AI and machine learning. The diagram illustrates the chronological development and relative segmentation of these three interconnected fields, showcasing deep learning as a specialized subset of machine learning, which in turn is a subset of AI. ![The diagram illustrates artificial intelligence as the overarching field encompassing all computational methods that mimic human cognitive functions. Machine learning is a subset of AI that includes algorithms capable of learning from data. Deep learning, a further subset of ML, specifically involves neural networks that are able to learn more complex patterns in large volumes of data. Source: NVIDIA.](images/png/ai_dl_progress_nvidia.png){#fig-ai-ml-dl} +As shown in the figure, AI represents the overarching field, encompassing all computational methods that mimic human cognitive functions. Machine learning, shown as a subset of AI, includes algorithms capable of learning from data. Deep learning, the smallest subset in the diagram, specifically involves neural networks that are able to learn more complex patterns from large volumes of data. + ### Brief History of Deep Learning The idea of deep learning has origins in early artificial neural networks. It has experienced several cycles of interest, starting with the introduction of the Perceptron in the 1950s [@rosenblatt1957perceptron], followed by the invention of backpropagation algorithms in the 1980s [@rumelhart1986learning]. The term "deep learning" became prominent in the 2000s, characterized by advances in computational power and data accessibility. Important milestones include the successful training of deep networks like AlexNet [@krizhevsky2012imagenet] by [Geoffrey Hinton](https://amturing.acm.org/award_winners/hinton_4791679.cfm), a leading figure in AI, and the renewed focus on neural networks as effective tools for data analysis and modeling. -Deep learning has recently seen exponential growth, transforming various industries. @fig-trends illustrates this remarkable progression, highlighting two key trends in the field. First, the graph shows that computational growth followed an 18-month doubling pattern from 1952 to 2010. This trend then dramatically accelerated to a 6-month doubling cycle from 2010 to 2022, indicating a significant leap in computational capabilities. Second, the figure depicts the emergence of large-scale models between 2015 and 2022. These models appeared 2 to 3 orders of magnitude faster than the general trend, following an even more aggressive 10-month doubling cycle. This rapid scaling of model sizes represents a paradigm shift in deep learning capabilities. +Deep learning has recently seen exponential growth, transforming various industries. @fig-trends illustrates this remarkable progression, highlighting two key trends in the field. First, the graph shows that computational growth followed an 18-month doubling pattern from 1952 to 2010. This trend then dramatically accelerated to a 6-month doubling cycle from 2010 to 2022, indicating a significant leap in computational capabilities. + +Second, the figure depicts the emergence of large-scale models between 2015 and 2022. These models appeared 2 to 3 orders of magnitude faster than the general trend, following an even more aggressive 10-month doubling cycle. This rapid scaling of model sizes represents a paradigm shift in deep learning capabilities. ![Growth of deep learning models.](https://epochai.org/assets/images/posts/2022/compute-trends.png){#fig-trends} @@ -80,7 +82,11 @@ Below, we examine the primary components and structures in neural networks. ### Perceptrons -The Perceptron is the basic unit or node that forms the foundation for more complex structures. It functions by taking multiple inputs, each representing a feature of the object under analysis, such as the characteristics of a home for predicting its price or the attributes of a song to forecast its popularity in music streaming services. These inputs are denoted as $x_1, x_2, ..., x_n$. +The Perceptron is the basic unit or node that forms the foundation for more complex structures. It functions by taking multiple inputs, each representing a feature of the object under analysis, such as the characteristics of a home for predicting its price or the attributes of a song to forecast its popularity in music streaming services. These inputs are denoted as $x_1, x_2, ..., x_n$. A perceptron can be configured to perform either regression or classification tasks. For regression, the actual numerical output $\hat{y}$ is used. For classification, the output depends on whether $\hat{y}$ crosses a certain threshold. If $\hat{y}$ exceeds this threshold, the perceptron might output one class (e.g., 'yes'), and if it does not, another class (e.g., 'no'). + +@fig-perceptron illustrates the fundamental building blocks of a perceptron, which serves as the foundation for more complex neural networks. A perceptron can be thought of as a miniature decision-maker, utilizing its weights, bias, and activation function to process inputs and generate outputs based on learned parameters. This concept forms the basis for understanding more intricate neural network architectures, such as multilayer perceptrons. In these advanced structures, layers of perceptrons work in concert, with each layer's output serving as the input for the subsequent layer. This hierarchical arrangement creates a deep learning model capable of comprehending and modeling complex, abstract patterns within data. By stacking these simple units, neural networks gain the ability to tackle increasingly sophisticated tasks, from image recognition to natural language processing. + +![Perceptron. Conceived in the 1950s, perceptrons paved the way for developing more intricate neural networks and have been a fundamental building block in deep learning. Source: Wikimedia - Chrislb.](images/png/Rosenblattperceptron.png){#fig-perceptron} Each input $x_i$ has a corresponding weight $w_{ij}$, and the perceptron simply multiplies each input by its matching weight. This operation is similar to linear regression, where the intermediate output, $z$, is computed as the sum of the products of inputs and their weights: @@ -104,12 +110,6 @@ $$ ![Activation functions enable the modeling of complex non-linear relationships. Source: Medium - Sachin Kaushik.](images/png/nonlinear_patterns.png){#fig-nonlinear} -A perceptron can be configured to perform either regression or classification tasks. For regression, the actual numerical output $\hat{y}$ is used. For classification, the output depends on whether $\hat{y}$ crosses a certain threshold. If $\hat{y}$ exceeds this threshold, the perceptron might output one class (e.g., 'yes'), and if it does not, another class (e.g., 'no'). - -![Perceptron. Conceived in the 1950s, perceptrons paved the way for developing more intricate neural networks and have been a fundamental building block in deep learning. Source: Wikimedia - Chrislb.](images/png/Rosenblattperceptron.png){#fig-perceptron} - -@fig-perceptron illustrates the fundamental building blocks of a perceptron, which serves as the foundation for more complex neural networks. A perceptron can be thought of as a miniature decision-maker, utilizing its weights, bias, and activation function to process inputs and generate outputs based on learned parameters. This concept forms the basis for understanding more intricate neural network architectures, such as multilayer perceptrons. In these advanced structures, layers of perceptrons work in concert, with each layer's output serving as the input for the subsequent layer. This hierarchical arrangement creates a deep learning model capable of comprehending and modeling complex, abstract patterns within data. By stacking these simple units, neural networks gain the ability to tackle increasingly sophisticated tasks, from image recognition to natural language processing. - ### Multilayer Perceptrons Multilayer perceptrons (MLPs) are an evolution of the single-layer perceptron model, featuring multiple layers of nodes connected in a feedforward manner. @fig-mlp provides a visual representation of this structure. As illustrated in the figure, information in a feedforward network moves in only one direction - from the input layer on the left, through the hidden layers in the middle, to the output layer on the right, without any cycles or loops. @@ -118,7 +118,7 @@ Multilayer perceptrons (MLPs) are an evolution of the single-layer perceptron mo While a single perceptron is limited in its capacity to model complex patterns, the real strength of neural networks emerges from the assembly of multiple layers. Each layer consists of numerous perceptrons working together, allowing the network to capture intricate and non-linear relationships within the data. With sufficient depth and breadth, these networks can approximate virtually any function, no matter how complex. -### Training Process +### Training Process A neural network receives an input, performs a calculation, and produces a prediction. The prediction is determined by the calculations performed within the sets of perceptrons found between the input and output layers. These calculations depend primarily on the input and the weights. Since you do not have control over the input, the objective during training is to adjust the weights in such a way that the output of the network provides the most accurate prediction. @@ -126,9 +126,7 @@ The training process involves several key steps, beginning with the forward pass #### Forward Pass -The forward pass is the initial phase where data moves through the network from the input to the output layer. At the start of training, the network's weights are randomly initialized, setting the initial conditions for learning. During the forward pass, each layer performs specific computations on the input data using these weights and biases, and the results are then passed to the subsequent layer. The final output of this phase is the network's prediction. This prediction is compared to the actual target values present in the dataset to calculate the loss, which can be thought of as the difference between the predicted outputs and the target values. The loss quantifies the network's performance at this stage, providing a crucial metric for the subsequent adjustment of weights during the backward pass. - -@fig-forward-propagation explains the concept of forward pass using an illustration. +The forward pass is the initial phase where data moves through the network from the input to the output layer, as illustrated in @fig-forward-propagation. At the start of training, the network's weights are randomly initialized, setting the initial conditions for learning. During the forward pass, each layer performs specific computations on the input data using these weights and biases, and the results are then passed to the subsequent layer. The final output of this phase is the network's prediction. This prediction is compared to the actual target values present in the dataset to calculate the loss, which can be thought of as the difference between the predicted outputs and the target values. The loss quantifies the network's performance at this stage, providing a crucial metric for the subsequent adjustment of weights during the backward pass. ![Neural networks - forward and backward propagation. Source: [Linkedin](https://www.linkedin.com/pulse/lecture2-unveiling-theoretical-foundations-ai-machine-underdown-phd-oqsuc/)](images/png/forwardpropagation.png){#fig-forward-propagation} @@ -205,8 +203,8 @@ CNNs are crucial for image and video recognition tasks, where real-time processi ### Convolutional Neural Networks (CNNs) -We discussed that CNNs excel at identifying image features, making them ideal for tasks like object classification. Now, you'll get to put this knowledge into action! This Colab notebook focuses on building a CNN to classify images from the CIFAR-10 dataset, which includes objects like airplanes, cars, and animals. You'll learn about the key differences between CIFAR-10 and the MNIST dataset we explored earlier and how these differences influence model choice. By the end of this notebook, you'll have a grasp of CNNs for image recognition and be well on your way to becoming a TinyML expert!   -   +We discussed that CNNs excel at identifying image features, making them ideal for tasks like object classification. Now, you'll get to put this knowledge into action! This Colab notebook focuses on building a CNN to classify images from the CIFAR-10 dataset, which includes objects like airplanes, cars, and animals. You'll learn about the key differences between CIFAR-10 and the MNIST dataset we explored earlier and how these differences influence model choice. By the end of this notebook, you'll have a grasp of CNNs for image recognition. + [![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/Mjrovai/UNIFEI-IESTI01-TinyML-2022.1/blob/main/00_Curse_Folder/1_Fundamentals/Class_11/CNN_Cifar_10.ipynb) ::: diff --git a/contents/efficient_ai/efficient_ai.qmd b/contents/efficient_ai/efficient_ai.qmd index 9ebd1b27..395188c7 100644 --- a/contents/efficient_ai/efficient_ai.qmd +++ b/contents/efficient_ai/efficient_ai.qmd @@ -10,7 +10,7 @@ Resources: [Slides](#sec-efficient-ai-resource), [Videos](#sec-efficient-ai-reso ![_DALL·E 3 Prompt: A conceptual illustration depicting efficiency in artificial intelligence using a shipyard analogy. The scene shows a bustling shipyard where containers represent bits or bytes of data. These containers are being moved around efficiently by cranes and vehicles, symbolizing the streamlined and rapid information processing in AI systems. The shipyard is meticulously organized, illustrating the concept of optimal performance within the constraints of limited resources. In the background, ships are docked, representing different platforms and scenarios where AI is applied. The atmosphere should convey advanced technology with an underlying theme of sustainability and wide applicability._](images/png/cover_efficient_ai.png) -Efficiency in artificial intelligence (AI) is not simply a luxury but a necessity. In this chapter, we dive into the key concepts underpinning AI systems' efficiency. The computational demands on neural networks can be daunting, even for minimal systems. For AI to be seamlessly integrated into everyday devices and essential systems, it must perform optimally within the constraints of limited resources while maintaining its efficacy. The pursuit of efficiency guarantees that AI models are streamlined, rapid, and sustainable, thereby widening their applicability across various platforms and scenarios. +Efficiency in artificial intelligence is not simply a luxury but a necessity. In this chapter, we dive into the key concepts underpinning AI systems' efficiency. The computational demands on neural networks can be daunting, even for minimal systems. For AI to be seamlessly integrated into everyday devices and essential systems, it must perform optimally within the constraints of limited resources while maintaining its efficacy. The pursuit of efficiency guarantees that AI models are streamlined, rapid, and sustainable, thereby widening their applicability across various platforms and scenarios. ::: {.callout-tip} @@ -103,7 +103,7 @@ Machine learning, and especially deep learning, involves enormous amounts of com ### Numerical Formats {#sec-numerical-formats} -There are many different types of numerics. Numerics have a long history in computing systems. +There are many different types of numerics. Numerics have a long history in computing systems. **Floating point:** Known as a single-precision floating point, FP32 utilizes 32 bits to represent a number, incorporating its sign, exponent, and mantissa. Understanding how floating point numbers are represented under the hood is crucial for grasping the various optimizations possible in numerical computations. The sign bit determines whether the number is positive or negative, the exponent controls the range of values that can be represented, and the mantissa determines the precision of the number. The combination of these components allows floating point numbers to represent a vast range of values with varying degrees of precision. @@ -117,18 +117,16 @@ There are many different types of numerics. Numerics have a long history in comp ::: +FP32 is widely adopted in many deep learning frameworks and balances accuracy and computational requirements. It is prevalent in the training phase for many neural networks due to its sufficient precision in capturing minute details during weight updates. Also known as half-precision floating point, FP16 uses 16 bits to represent a number, including its sign, exponent, and fraction. It offers a good balance between precision and memory savings. FP16 is particularly popular in deep learning training on GPUs that support mixed-precision arithmetic, combining the speed benefits of FP16 with the precision of FP32 where needed. -FP32 is widely adopted in many deep learning frameworks and balances accuracy and computational requirements. It is prevalent in the training phase for many neural networks due to its sufficient precision in capturing minute details during weight updates. -Also known as half-precision floating point, FP16 uses 16 bits to represent a number, including its sign, exponent, and fraction. It offers a good balance between precision and memory savings. FP16 is particularly popular in deep learning training on GPUs that support mixed-precision arithmetic, combining the speed benefits of FP16 with the precision of FP32 where needed. +@fig-float-point-formats shows three different floating-point formats: Float32, Float16, and BFloat16. + +![Three floating-point formats.](images/png/three_float_types.png){#fig-float-point-formats width=90%} Several other numerical formats fall into an exotic class. An exotic example is BF16 or Brain Floating Point. It is a 16-bit numerical format designed explicitly for deep learning applications. It is a compromise between FP32 and FP16, retaining the 8-bit exponent from FP32 while reducing the mantissa to 7 bits (as compared to FP32's 23-bit mantissa). This structure prioritizes range over precision. BF16 has achieved training results comparable in accuracy to FP32 while using significantly less memory and computational resources [@kalamkar2019study]. This makes it suitable not just for inference but also for training deep neural networks. By retaining the 8-bit exponent of FP32, BF16 offers a similar range, which is crucial for deep learning tasks where certain operations can result in very large or very small numbers. At the same time, by truncating precision, BF16 allows for reduced memory and computational requirements compared to FP32. BF16 has emerged as a promising middle ground in the landscape of numerical formats for deep learning, providing an efficient and effective alternative to the more traditional FP32 and FP16 formats. -@fig-float-point-formats shows three different floating-point formats: Float32, Float16, and BFloat16. - -![Three floating-point formats.](images/png/three_float_types.png){#fig-float-point-formats width=90%} - **Integer:** These are integer representations using 8, 4, and 2 bits. They are often used during the inference phase of neural networks, where the weights and activations of the model are quantized to these lower precisions. Integer representations are deterministic and offer significant speed and memory advantages over floating-point representations. For many inference tasks, especially on edge devices, the slight loss in accuracy due to quantization is often acceptable, given the efficiency gains. An extreme form of integer numerics is for binary neural networks (BNNs), where weights and activations are constrained to one of two values: +1 or -1. **Variable bit widths:** Beyond the standard widths, research is ongoing into extremely low bit-width numerics, even down to binary or ternary representations. Extremely low bit-width operations can offer significant speedups and further reduce power consumption. While challenges remain in maintaining model accuracy with such drastic quantization, advances continue to be made in this area. diff --git a/contents/frameworks/frameworks.qmd b/contents/frameworks/frameworks.qmd index c369eb7d..d35bf4d8 100644 --- a/contents/frameworks/frameworks.qmd +++ b/contents/frameworks/frameworks.qmd @@ -74,7 +74,7 @@ Each generation of frameworks unlocked new capabilities that powered advancement * TensorFlow Graphics (2020) added 3D data structures to handle point clouds and meshes. -In recent years, the frameworks have converged. @fig-ml-framework shows that TensorFlow and PyTorch have become the overwhelmingly dominant ML frameworks, representing more than 95% of ML frameworks used in research and production. @fig-tensorflow-pytorch draws a contrast between the attributes of TensorFlow and PyTorch. Keras was integrated into TensorFlow in 2019; Preferred Networks transitioned Chainer to PyTorch in 2019; and Microsoft stopped actively developing CNTK in 2022 to support PyTorch on Windows. +In recent years, the landscape of machine learning frameworks has significantly consolidated. @fig-ml-framework illustrates this convergence, showing that TensorFlow and PyTorch have become the overwhelmingly dominant ML frameworks, collectively representing more than 95% of ML frameworks used in research and production. While both frameworks have risen to prominence, they have distinct characteristics. @fig-tensorflow-pytorch draws a contrast between the attributes of TensorFlow and PyTorch, helping to explain their complementary dominance in the field. ![PyTorch vs. TensorFlow: Features and Functions. Source: [K&C](https://www.google.com/url?sa=i&url=https%3A%2F%2Fkruschecompany.com%2Fpytorch-vs-tensorflow%2F&psig=AOvVaw1-DSFxXYprQmYH7Z4Nk6Tk&ust=1722533288351000&source=images&cd=vfe&opi=89978449&ved=0CBEQjRxqFwoTCPDhst7m0YcDFQAAAAAdAAAAABAg)](images/png/tensorflowpytorch.png){#fig-tensorflow-pytorch} @@ -190,9 +190,7 @@ PyTorch and TensorFlow have established themselves as frontrunners in the indust **Performance:** Both frameworks offer efficient hardware acceleration for their operations. However, TensorFlow has a slightly more robust optimization workflow, such as the XLA (Accelerated Linear Algebra) compiler, which can further boost performance. Its static computational graph was also advantageous for certain optimizations in the early versions. -**Ecosystem:** PyTorch has a growing ecosystem with tools like TorchServe for serving models and libraries like TorchVision, TorchText, and TorchAudio for specific domains. As we mentioned earlier, TensorFlow has a broad and mature ecosystem. TensorFlow Extended (TFX) provides an end-to-end platform for deploying production machine learning pipelines. Other tools and libraries include TensorFlow Lite, TensorFlow Lite Micro, TensorFlow.js, TensorFlow Hub, and TensorFlow Serving. - -@tbl-pytorch_vs_tf provides a comparative analysis: +**Ecosystem:** PyTorch has a growing ecosystem with tools like TorchServe for serving models and libraries like TorchVision, TorchText, and TorchAudio for specific domains. As we mentioned earlier, TensorFlow has a broad and mature ecosystem. TensorFlow Extended (TFX) provides an end-to-end platform for deploying production machine learning pipelines. Other tools and libraries include TensorFlow Lite, TensorFlow Lite Micro, TensorFlow.js, TensorFlow Hub, and TensorFlow Serving. @tbl-pytorch_vs_tf provides a comparative analysis: +-------------------------------+--------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+ | Aspect | Pytorch | TensorFlow | @@ -216,14 +214,13 @@ Having introduced the popular machine learning frameworks and provided a high-le ### Tensor data structures {#sec-tensor-data-structures} -To understand tensors, let us start from the familiar concepts in linear algebra. As demonstrated in @fig-tensor-data-structure, vectors can be represented as a stack of numbers in a 1-dimensional array. Matrices follow the same idea, and one can think of them as many vectors stacked on each other, making them 2 dimensional. Higher dimensional tensors work the same way. A 3-dimensional tensor is simply a set of matrices stacked on each other in another direction. Therefore, vectors and matrices can be considered special cases of tensors with 1D and 2D dimensions, respectively. - -![Visualization of Tensor Data Structure.](images/png/image2.png){#fig-tensor-data-structure} +As shown in the figure, vectors can be represented as a stack of numbers in a 1-dimensional array. Matrices follow the same idea, and one can think of them as many vectors stacked on each other, making them 2 dimensional. Higher dimensional tensors work the same way. A 3-dimensional tensor, as illustrated in @fig-tensor-data-structure-a, is simply a set of matrices stacked on each other in another direction. Therefore, vectors and matrices can be considered special cases of tensors with 1D and 2D dimensions, respectively. -Tensors offer a flexible structure that can represent data in higher dimensions. For instance, to represent image data, the pixels at each position of an image are structured as matrices. However, images are not represented by just one matrix of pixel values; they typically have three channels where each channel is a matrix containing pixel values that represent the intensity of red, green, or blue. Together, these channels create a colored image. Without tensors, storing all this information from multiple matrices can be complex. With tensors, it is easy to contain image data in a single 3-dimensional tensor, with each number representing a certain color value at a specific location in the image. +![Visualization of Tensor Data Structure.](images/png/image2.png){#fig-tensor-data-structure-a} -![Visualization of colored image structure that can be easily stored as a 3D Tensor. Credit: [Niklas Lang](https://towardsdatascience.com/what-are-tensors-in-machine-learning-5671814646ff)](images/png/color_channels_of_image.png){#fig-tensor-data-structure} +Tensors offer a flexible structure that can represent data in higher dimensions. @fig-tensor-data-structure-b illustrates how this concept applies to image data. As shown in the figure, images are not represented by just one matrix of pixel values. Instead, they typically have three channels, where each channel is a matrix containing pixel values that represent the intensity of red, green, or blue. Together, these channels create a colored image. Without tensors, storing all this information from multiple matrices can be complex. However, as @fig-tensor-data-structure-b illustrates, tensors make it easy to contain image data in a single 3-dimensional structure, with each number representing a certain color value at a specific location in the image. +![Visualization of colored image structure that can be easily stored as a 3D Tensor. Credit: [Niklas Lang](https://towardsdatascience.com/what-are-tensors-in-machine-learning-5671814646ff)](images/png/color_channels_of_image.png){#fig-tensor-data-structure-b} You don't have to stop there. If we wanted to store a series of images, we could use a 4-dimensional tensor, where the new dimension represents different images. This means you are storing multiple images, each having three matrices that represent the three color channels. This gives you an idea of the usefulness of tensors when dealing with multi-dimensional data efficiently. @@ -302,13 +299,13 @@ This automatic differentiation is a powerful feature of tensors in frameworks li #### Graph Definition -Computational graphs are a key component of deep learning frameworks like TensorFlow and PyTorch. They allow us to express complex neural network architectures efficiently and differently. A computational graph consists of a directed acyclic graph (DAG) where each node represents an operation or variable, and edges represent data dependencies between them. +Computational graphs are a key component of deep learning frameworks like TensorFlow and PyTorch. They allow us to express complex neural network architectures efficiently and differently. A computational graph consists of a directed acyclic graph (DAG) where each node represents an operation or variable, and edges represent data dependencies between them. -It's important to differentiate computational graphs from neural network diagrams, such as those for multilayer perceptrons (MLPs), which depict nodes and layers. Neural network diagrams, as depicted in [Chapter 3](../dl_primer/dl_primer.qmd), visualize the architecture and flow of data through nodes and layers, providing an intuitive understanding of the model's structure. In contrast, computational graphs provide a low-level representation of the underlying mathematical operations and data dependencies required to implement and train these networks. +It is important to differentiate computational graphs from neural network diagrams, such as those for multilayer perceptrons (MLPs), which depict nodes and layers. Neural network diagrams, as depicted in [Chapter 3](../dl_primer/dl_primer.qmd), visualize the architecture and flow of data through nodes and layers, providing an intuitive understanding of the model's structure. In contrast, computational graphs provide a low-level representation of the underlying mathematical operations and data dependencies required to implement and train these networks. -For example, a node might represent a matrix multiplication operation, taking two input matrices (or tensors) and producing an output matrix (or tensor). To visualize this, consider the simple example in @fig-computational-graph. The directed acyclic graph above computes $z = x \times y$, where each variable is just numbers. +For example, a node might represent a matrix multiplication operation, taking two input matrices (or tensors) and producing an output matrix (or tensor). To visualize this, consider the simple example in @fig-comp-graph. The directed acyclic graph computes $z = x \times y$, where each variable is just numbers. -![Basic example of a computational graph.](images/png/image1.png){#fig-computational-graph width="50%" height="auto" align="center"} +![Basic example of a computational graph.](images/png/image1.png){#fig-comp-graph width="50%" height="auto" align="center"} Frameworks like TensorFlow and PyTorch create computational graphs to implement the architectures of neural networks that we typically represent with diagrams. When you define a neural network layer in code (e.g., a dense layer in TensorFlow), the framework constructs a computational graph that includes all the necessary operations (such as matrix multiplication, addition, and activation functions) and their data dependencies. This graph enables the framework to efficiently manage the flow of data, optimize the execution of operations, and automatically compute gradients for training. Underneath the hood, the computational graphs represent abstractions for common layers like convolutional, pooling, recurrent, and dense layers, with data including activations, weights, and biases represented in tensors. This representation allows for efficient computation, leveraging the structure of the graph to parallelize operations and apply optimizations. @@ -405,15 +402,15 @@ Computational graphs can only be as good as the data they learn from and work on At the core of these pipelines are data loaders, which handle reading training examples from sources like files, databases, and object storage. Data loaders facilitate efficient data loading and preprocessing, crucial for deep learning models. For instance, TensorFlow's [tf.data](https://www.tensorflow.org/guide/data) dataloading pipeline is designed to manage this process. Depending on the application, deep learning models require diverse data formats such as CSV files or image folders. Some popular formats include: -* CSV, a versatile, simple format often used for tabular data. +* **CSV**: A versatile, simple format often used for tabular data. -* TFRecord: TensorFlow's proprietary format, optimized for performance. +* **TFRecord**: TensorFlow's proprietary format, optimized for performance. -* Parquet: Columnar storage, offering efficient data compression and retrieval. +* **Parquet**: Columnar storage, offering efficient data compression and retrieval. -* JPEG/PNG: Commonly used for image data. +* **JPEG/PNG**: Commonly used for image data. -* WAV/MP3: Prevalent formats for audio data. +* **WAV/MP3**: Prevalent formats for audio data. Data loaders batch examples to leverage vectorization support in hardware. Batching refers to grouping multiple data points for simultaneous processing, leveraging the vectorized computation capabilities of hardware like GPUs. While typical batch sizes range from 32 to 512 examples, the optimal size often depends on the data's memory footprint and the specific hardware constraints. Advanced loaders can stream virtually unlimited datasets from disk and cloud storage. They stream large datasets from disks or networks instead of fully loading them into memory, enabling unlimited dataset sizes. @@ -549,9 +546,11 @@ These steps to remove barriers to entry continue to democratize machine learning Transfer learning is the practice of using knowledge gained from a pre-trained model to train and improve the performance of a model for a different task. For example, models such as MobileNet and ResNet are trained on the ImageNet dataset. To do so, one may freeze the pre-trained model, utilizing it as a feature extractor to train a much smaller model built on top of the feature extraction. One can also fine-tune the entire model to fit the new task. Machine learning frameworks make it easy to load pre-trained models, freeze specific layers, and train custom layers on top. They simplify this process by providing intuitive APIs and easy access to large repositories of [pre-trained models](https://keras.io/api/applications/). -Transfer learning has challenges, such as the modified model's inability to conduct its original tasks after transfer learning. Papers such as ["Learning without Forgetting"](https://browse.arxiv.org/pdf/1606.09282.pdf) by @li2017learning try to address these challenges and have been implemented in modern machine learning platforms. @fig-transfer-learning simplifies the concept of transfer learning through an example. +Transfer learning, while powerful, comes with challenges. One significant issue is the modified model's potential inability to conduct its original tasks after transfer learning. To address these challenges, researchers have proposed various solutions. For example, @li2017learning introduced the concept of "Learning without Forgetting" in their paper ["Learning without Forgetting"](https://browse.arxiv.org/pdf/1606.09282.pdf), which has since been implemented in modern machine learning platforms. @fig-tl provides a simplified illustration of the transfer learning concept: + +![Transfer learning. Source: [Tech Target](https://www.google.com/url?sa=i&url=https%3A%2F%2Fanalyticsindiamag.com%2Fdevelopers-corner%2Fcomplete-guide-to-understanding-precision-and-recall-curves%2F&psig=AOvVaw3MosZItazJt2eermLTArjj&ust=1722534897757000&source=images&cd=vfe&opi=89978449&ved=0CBEQjRxqFwoTCIi389bs0YcDFQAAAAAdAAAAABAw)](images/png/transferlearning.png){#fig-tl} -![Transfer learning. Source: [Tech Target](https://www.google.com/url?sa=i&url=https%3A%2F%2Fanalyticsindiamag.com%2Fdevelopers-corner%2Fcomplete-guide-to-understanding-precision-and-recall-curves%2F&psig=AOvVaw3MosZItazJt2eermLTArjj&ust=1722534897757000&source=images&cd=vfe&opi=89978449&ved=0CBEQjRxqFwoTCIi389bs0YcDFQAAAAAdAAAAABAw)](images/png/transferlearning.png){#fig-transfer-learning} +As shown in @fig-tl, transfer learning involves taking a model trained on one task (the source task) and adapting it to perform a new, related task (the target task). This process allows the model to leverage knowledge gained from the source task, potentially improving performance and reducing training time on the target task. However, as mentioned earlier, care must be taken to ensure that the model doesn't "forget" its ability to perform the original task during this process. #### Federated Learning @@ -763,29 +762,43 @@ Through various custom techniques, such as static compilation, model-based sched ## Choosing the Right Framework -Choosing the right machine learning framework for a given application requires carefully evaluating models, hardware, and software considerations. By analyzing these three aspects—models, hardware, and software—ML engineers can select the optimal framework and customize it as needed for efficient and performant on-device ML applications. The goal is to balance model complexity, hardware limitations, and software integration to design a tailored ML pipeline for embedded and edge devices. +Choosing the right machine learning framework for a given application requires carefully evaluating models, hardware, and software considerations. @fig-tf-comparison provides a comparison of different TensorFlow frameworks, which we'll discuss in more detail: ![TensorFlow Framework Comparison - General. Source: TensorFlow.](images/png/image4.png){#fig-tf-comparison width="100%" height="auto" align="center" caption="TensorFlow Framework Comparison - General"} +Analyzing these three aspects—models, hardware, and software—as depicted in @fig-tf-comparison, ML engineers can select the optimal framework and customize it as needed for efficient and performant on-device ML applications. The goal is to balance model complexity, hardware limitations, and software integration to design a tailored ML pipeline for embedded and edge devices. As we examine the differences shown in @fig-tf-comparison, we'll gain insights into how to pick the right framework and understand what causes the variations between frameworks. + ### Model -TensorFlow supports significantly more operations (ops) than TensorFlow Lite and TensorFlow Lite Micro as it is typically used for research or cloud deployment, which require a large number of and more flexibility with operators (see @fig-tf-comparison). TensorFlow Lite supports select ops for on-device training, whereas TensorFlow Micro does not. TensorFlow Lite also supports dynamic shapes and quantization-aware training, but TensorFlow Micro does not. In contrast, TensorFlow Lite and TensorFlow Micro offer native quantization tooling and support, where quantization refers to transforming an ML program into an approximated representation with available lower precision operations. +@fig-tf-comparison illustrates the key differences between TensorFlow variants, particularly in terms of supported operations (ops) and features. TensorFlow supports significantly more operations than TensorFlow Lite and TensorFlow Lite Micro, as it is typically used for research or cloud deployment, which require a large number of and more flexibility with operators. + +The figure clearly demonstrates this difference in op support across the frameworks. TensorFlow Lite supports select ops for on-device training, whereas TensorFlow Micro does not. Additionally, the figure shows that TensorFlow Lite supports dynamic shapes and quantization-aware training, features that are absent in TensorFlow Micro. In contrast, both TensorFlow Lite and TensorFlow Micro offer native quantization tooling and support. Here, quantization refers to transforming an ML program into an approximated representation with available lower precision operations, a crucial feature for embedded and edge devices with limited computational resources. ### Software +As shown in @fig-tf-sw-comparison, TensorFlow Lite Micro does not have OS support, while TensorFlow and TensorFlow Lite do. This design choice for TensorFlow Lite Micro helps reduce memory overhead, make startup times faster, and consume less energy. Instead, TensorFlow Lite Micro can be used in conjunction with real-time operating systems (RTOS) like FreeRTOS, Zephyr, and Mbed OS. + +The figure also highlights an important memory management feature: TensorFlow Lite and TensorFlow Lite Micro support model memory mapping, allowing models to be directly accessed from flash storage rather than loaded into RAM. In contrast, TensorFlow does not offer this capability. + ![TensorFlow Framework Comparison - Software. Source: TensorFlow.](images/png/image5.png){#fig-tf-sw-comparison width="100%" height="auto" align="center" caption="TensorFlow Framework Comparison - Model"} -TensorFlow Lite Micro does not have OS support, while TensorFlow and TensorFlow Lite do, to reduce memory overhead, make startup times faster, and consume less energy (see @fig-tf-sw-comparison). TensorFlow Lite Micro can be used in conjunction with real-time operating systems (RTOS) like FreeRTOS, Zephyr, and Mbed OS. TensorFlow Lite and TensorFlow Lite Micro support model memory mapping, allowing models to be directly accessed from flash storage rather than loaded into RAM, whereas TensorFlow does not. TensorFlow and TensorFlow Lite support accelerator delegation to schedule code to different accelerators, whereas TensorFlow Lite Micro does not, as embedded systems tend to have a limited array of specialized accelerators. +Another key difference is accelerator delegation. TensorFlow and TensorFlow Lite support this feature, allowing them to schedule code to different accelerators. However, TensorFlow Lite Micro does not offer accelerator delegation, as embedded systems tend to have a limited array of specialized accelerators. + +These differences demonstrate how each TensorFlow variant is optimized for its target deployment environment, from powerful cloud servers to resource-constrained embedded devices. ### Hardware +TensorFlow Lite and TensorFlow Lite Micro have significantly smaller base binary sizes and memory footprints than TensorFlow (see @fig-tf-hw-comparison). For example, a typical TensorFlow Lite Micro binary is less than 200KB, whereas TensorFlow is much larger. This is due to the resource-constrained environments of embedded systems. TensorFlow supports x86, TPUs, and GPUs like NVIDIA, AMD, and Intel. + ![TensorFlow Framework Comparison - Hardware. Source: TensorFlow.](images/png/image3.png){#fig-tf-hw-comparison width="100%" height="auto" align="center" caption="TensorFlow Framework Comparison - Hardware"} -TensorFlow Lite and TensorFlow Lite Micro have significantly smaller base binary sizes and memory footprints than TensorFlow (see @fig-tf-hw-comparison). For example, a typical TensorFlow Lite Micro binary is less than 200KB, whereas TensorFlow is much larger. This is due to the resource-constrained environments of embedded systems. TensorFlow supports x86, TPUs, and GPUs like NVIDIA, AMD, and Intel. TensorFlow Lite supports Arm Cortex-A and x86 processors commonly used on mobile phones and tablets. The latter is stripped of all the unnecessary training logic for on-device deployment. TensorFlow Lite Micro provides support for microcontroller-focused Arm Cortex M cores like M0, M3, M4, and M7, as well as DSPs like Hexagon and SHARC and MCUs like STM32, NXP Kinetis, Microchip AVR. +TensorFlow Lite supports Arm Cortex-A and x86 processors commonly used on mobile phones and tablets. The latter is stripped of all the unnecessary training logic for on-device deployment. TensorFlow Lite Micro provides support for microcontroller-focused Arm Cortex M cores like M0, M3, M4, and M7, as well as DSPs like Hexagon and SHARC and MCUs like STM32, NXP Kinetis, Microchip AVR. ### Other Factors -Selecting the appropriate AI framework is essential to ensure that embedded systems can efficiently execute AI models. Several key factors beyond models, hardware, and software should be considered when evaluating AI frameworks for embedded systems. Other key factors to consider when choosing a machine learning framework are performance, scalability, ease of use, integration with data engineering tools, integration with model optimization tools, and community support. By understanding these factors, you can make informed decisions and maximize the potential of your machine-learning initiatives. +Selecting the appropriate AI framework is essential to ensure that embedded systems can efficiently execute AI models. Several key factors beyond models, hardware, and software should be considered when evaluating AI frameworks for embedded systems. + +Other key factors to consider when choosing a machine learning framework are performance, scalability, ease of use, integration with data engineering tools, integration with model optimization tools, and community support. Developers can make informed decisions and maximize the potential of your machine-learning initiatives by understanding these various factors. #### Performance @@ -843,7 +856,7 @@ We first introduced the necessity of machine learning frameworks like TensorFlow Advanced features further improve these frameworks' usability, enabling tasks like fine-tuning large pre-trained models and facilitating federated learning. These capabilities are critical for developing sophisticated machine learning models efficiently. -Embedded AI frameworks, such as TensorFlow Lite Micro, provide specialized tools for deploying models on resource-constrained platforms. TensorFlow Lite Micro, for instance, offers comprehensive optimization tooling, including quantization mapping and kernel optimizations, to ensure high performance on microcontroller-based platforms like Arm Cortex-M and RISC-V processors. Frameworks specifically built for specialized hardware like CMSIS-NN on Cortex-M processors can further maximize performance but sacrifice portability. Integrated frameworks from processor vendors tailor the stack to their architectures, unlocking the full potential of their chips but locking you into their ecosystem. +Embedded AI or TinyML frameworks, such as TensorFlow Lite Micro, provide specialized tools for deploying models on resource-constrained platforms. TensorFlow Lite Micro, for instance, offers comprehensive optimization tooling, including quantization mapping and kernel optimizations, to ensure high performance on microcontroller-based platforms like Arm Cortex-M and RISC-V processors. Frameworks specifically built for specialized hardware like CMSIS-NN on Cortex-M processors can further maximize performance but sacrifice portability. Integrated frameworks from processor vendors tailor the stack to their architectures, unlocking the full potential of their chips but locking you into their ecosystem. Ultimately, choosing the right framework involves finding the best match between its capabilities and the requirements of the target platform. This requires balancing trade-offs between performance needs, hardware constraints, model complexity, and other factors. Thoroughly assessing the intended models and use cases and evaluating options against key metrics will guide developers in selecting the ideal framework for their machine learning applications. diff --git a/contents/hw_acceleration/hw_acceleration.qmd b/contents/hw_acceleration/hw_acceleration.qmd index 590bbb6f..057efb8a 100644 --- a/contents/hw_acceleration/hw_acceleration.qmd +++ b/contents/hw_acceleration/hw_acceleration.qmd @@ -58,14 +58,14 @@ This evolution demonstrates how hardware acceleration has focused on solving com The evolution of hardware acceleration is closely tied to the broader history of computing. Central to this history is the role of transistors, the fundamental building blocks of modern electronics. Transistors act as tiny switches that can turn on or off, enabling the complex computations that drive everything from simple calculators to advanced machine learning models. In the early decades, chip design was governed by Moore's Law, which predicted that the number of transistors on an integrated circuit would double approximately every two years, and Dennard Scaling, which observed that as transistors became smaller, their performance (speed) increased, while power density (power per unit area) remained constant. These two laws were held through the single-core era. @fig-moore-dennard shows the trends of different microprocessor metrics. As the figure denotes, Dennard Scaling fails around the mid-2000s; notice how the clock speed (frequency) remains almost constant even as the number of transistors keeps increasing. +![Microprocessor trends. Source: [Karl Rupp](https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/).](images/png/hwai_40yearsmicrotrenddata.png){#fig-moore-dennard} + However, as @patterson2016computer describes, technological constraints eventually forced a transition to the multicore era, with chips containing multiple processing cores to deliver performance gains. Power limitations prevented further scaling, which led to "dark silicon" ([Dark Silicon](https://en.wikipedia.org/wiki/Dark_silicon)), where not all chip areas could be simultaneously active [@xiu2019time]. "Dark silicon" refers to portions of the chip that cannot be powered simultaneously due to thermal and power limitations. Essentially, as the density of transistors increased, the proportion of the chip that could be actively used without overheating or exceeding power budgets shrank. This phenomenon meant that while chips had more transistors, not all could be operational simultaneously, limiting potential performance gains. This power crisis necessitated a shift to the accelerator era, with specialized hardware units tailored for specific tasks to maximize efficiency. The explosion in AI workloads further drove demand for customized accelerators. Enabling factors included new programming languages, software tools, and manufacturing advances. -![Microprocessor trends. Source: [Karl Rupp](https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/).](images/png/hwai_40yearsmicrotrenddata.png){#fig-moore-dennard} - Fundamentally, hardware accelerators are evaluated on performance, power, and silicon area (PPA)—the nature of the target application—whether memory-bound or compute-bound—heavily influences the design. For example, memory-bound workloads demand high bandwidth and low latency access, while compute-bound applications require maximal computational throughput. ### General Principles @@ -140,10 +140,10 @@ By structuring the analysis along this spectrum, we aim to illustrate the fundam @fig-design-tradeoffs illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functional diversity vs performance: general purpose architectures can serve diverse applications but their application performance is degraded as compared to more customized architectures. -The progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach elucidates the accelerator design space. - ![Design tradeoffs. Source: @rayis2014.](images/png/tradeoffs.png){#fig-design-tradeoffs} +The progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach elucidates the accelerator design space. + ### Application-Specific Integrated Circuits (ASICs) An Application-Specific Integrated Circuit (ASIC) is a type of [integrated circuit](https://en.wikipedia.org/wiki/Integrated_circuit) (IC) that is custom-designed for a specific application or workload rather than for general-purpose use. Unlike CPUs and GPUs, ASICs do not support multiple applications or workloads. Rather, they are optimized to perform a single task extremely efficiently. The Google TPU is an example of an ASIC. @@ -267,7 +267,7 @@ While FPGAs may not achieve the utmost performance and efficiency of workload-sp ##### Customized Parallelism and Pipelining -FPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, Intel's HARPv2 FPGA platform splits the layers of an MNIST convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model's structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs. +FPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, on an Intel's HARPv2 FPGA platform one can split the layers of a convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model's structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs. ##### Low Latency On-Chip Memory @@ -275,15 +275,14 @@ Large amounts of high-bandwidth on-chip memory enable localized storage for weig ##### Native Support for Low Precision -A key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16, used in quantized ML models. For example, Intel's Stratix 10 NX FPGAs have dedicated INT8 cores that can achieve up to 143 INT8 TOPS (Tera Operations Per Second) at ~1 TOPS/W (Tera Operations Per Second per Watt) [Intel Stratix 10 NX FPGA -](https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10/nx.html). TOPS is a measure of performance similar to FLOPS, but while FLOPS measures floating-point calculations, TOPS measures the number of integer operations a system can perform per second. Lower bit widths, like INT8 or INT4, increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime. +A key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16, used in quantized ML models. For example, [Intel Stratix 10 NX FPGA +](https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10/nx.html) has dedicated INT8 cores that can achieve up to 143 INT8 TOPS (Tera Operations Per Second) at ~1 TOPS/W (Tera Operations Per Second per Watt). TOPS is a measure of performance similar to FLOPS, but while FLOPS measures floating-point calculations, TOPS measures the number of integer operations a system can perform per second. Lower bit widths, like INT8 or INT4, increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime. #### Disadvantages ##### Lower Peak Throughput than ASICs -FPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 petaOps (Peta Operations Per Second) of INT8 performance, while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS [Intel Stratix 10 NX FPGA -](https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10/nx.html). PetaOps represents quadrillions of operations per second, whereas TOPS measures trillions, highlighting the much greater throughput capability of TPU pods compared to FPGAs. +FPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 PetaOps (Peta Operations Per Second) of INT8 performance, while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS such as on the Intel Stratix 10 NX FPGA; PetaOps represents quadrillions of operations per second, whereas TOPS measures trillions, highlighting the much greater throughput capability of TPU pods compared to FPGAs. This is because FPGAs comprise basic building blocks—configurable logic blocks, RAM blocks, and interconnects. Vendors provide a set amount of these resources. To program FPGAs, engineers write HDL code and compile it into bitstreams that rearrange the fabric, which has inherent overheads versus an ASIC purpose-built for one computation. @@ -319,7 +318,7 @@ DSPs integrate large amounts of fast on-chip SRAM memory to hold data locally fo ##### Power Efficiency -DSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, [Qualcomm's Hexagon DSP](https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor) can deliver 4 trillion operations per second (TOPS) while consuming minimal watts. +DSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, [Qualcomm's Hexagon DSP](https://developer.qualcomm.com/software/hexagon-dsp-sdk/dsp-processor) can deliver 4 TOPS while consuming minimal watts. ##### Support for Integer and Floating Point Math @@ -696,7 +695,7 @@ The [benchmarking chapter](../benchmarking/benchmarking.qmd) explores this topic Benchmarking suites such as MLPerf, Fathom, and AI Benchmark offer a set of standardized tests that can be used across different hardware platforms. These suites measure AI accelerator performance across various neural networks and machine learning tasks, from basic image classification to complex language processing. Providing a common ground for Comparison, they help ensure that performance claims are consistent and verifiable. These "tools" are applied not only to guide the development of hardware but also to ensure that the software stack leverages the full potential of the underlying architecture. -* **MLPerf:** Includes a broad set of benchmarks covering both training [@mattson2020mlperf] and inference [@reddi2020mlperf] for a range of machine learning tasks. @fig-ml-perf showcases the uses of MLperf. +* **MLPerf:** Includes a broad set of benchmarks covering both training [@mattson2020mlperf] and inference [@reddi2020mlperf] for a range of machine learning tasks. @fig-ml-perf showcases the diversity of AI use cases covered by MLPerf. * **Fathom:** Focuses on core operations in deep learning models, emphasizing their execution on different architectures [@adolf2016fathom]. * **AI Benchmark:** Targets mobile and consumer devices, assessing AI performance in end-user applications [@ignatov2018ai]. @@ -799,10 +798,10 @@ In response, new manufacturing techniques like wafer-scale fabrication and advan Wafer-scale AI takes an extremely integrated approach, manufacturing an entire silicon wafer as one gigantic chip. This differs drastically from conventional CPUs and GPUs, which cut each wafer into many smaller individual chips. @fig-wafer-scale shows a comparison between Cerebras Wafer Scale Engine 2, which is the largest chip ever built, and the largest GPU. While some GPUs may contain billions of transistors, they still pale in Comparison to the scale of a wafer-size chip with over a trillion transistors. -The wafer-scale approach also diverges from more modular system-on-chip designs that still have discrete components communicating by bus. Instead, wafer-scale AI enables full customization and tight integration of computation, memory, and interconnects across the entire die. - ![Wafer-scale vs. GPU. Source: [Cerebras](https://www.cerebras.net/product-chip/).](images/png/aimage1.png){#fig-wafer-scale} +The wafer-scale approach also diverges from more modular system-on-chip designs that still have discrete components communicating by bus. Instead, wafer-scale AI enables full customization and tight integration of computation, memory, and interconnects across the entire die. + By designing the wafer as one integrated logic unit, data transfer between elements is minimized. This provides lower latency and power consumption than discrete system-on-chip or chiplet designs. While chiplets can offer flexibility by mixing and matching components, communication between chiplets is challenging. The monolithic nature of wafer-scale integration eliminates these inter-chip communication bottlenecks. However, the ultra-large-scale also poses difficulties for manufacturability and yield with wafer-scale designs. Defects in any region of the wafer can make (certain parts of) the chip unusable. Specialized lithography techniques are required to produce such large dies. So, wafer-scale integration pursues the maximum performance gains from integration but requires overcoming substantial fabrication challenges. @@ -819,7 +818,9 @@ However, the ultra-large-scale also poses difficulties for manufacturability and #### Chiplets for AI -Chiplet design refers to a semiconductor architecture in which a single integrated circuit (IC) is constructed from multiple smaller, individual components known as chiplets. Each chiplet is a self-contained functional block, typically specialized for a specific task or functionality. These chiplets are then interconnected on a larger substrate or package to create a cohesive system. @fig-chiplet illustrates this concept. For AI hardware, chiplets enable the mixing of different types of chips optimized for tasks like matrix multiplication, data movement, analog I/O, and specialized memories. This heterogeneous integration differs greatly from wafer-scale integration, where all logic is manufactured as one monolithic chip. Companies like Intel and AMD have adopted chiplet designs for their CPUs. +Chiplet design refers to a semiconductor architecture in which a single integrated circuit (IC) is constructed from multiple smaller, individual components known as chiplets. Each chiplet is a self-contained functional block, typically specialized for a specific task or functionality. These chiplets are then interconnected on a larger substrate or package to create a cohesive system. + +@fig-chiplet illustrates this concept. For AI hardware, chiplets enable the mixing of different types of chips optimized for tasks like matrix multiplication, data movement, analog I/O, and specialized memories. This heterogeneous integration differs greatly from wafer-scale integration, where all logic is manufactured as one monolithic chip. Companies like Intel and AMD have adopted chiplet designs for their CPUs. Chiplets are interconnected using advanced packaging techniques like high-density substrate interposers, 2.5D/3D stacking, and wafer-level packaging. This allows combining chiplets fabricated with different process nodes, specialized memories, and various optimized AI engines. @@ -848,7 +849,9 @@ Neuromorphic computing is an emerging field aiming to emulate the efficiency and Intel and IBM are leading commercial efforts in neuromorphic hardware. Intel's Loihi and Loihi 2 chips [@davies2018loihi; @davies2021advancing] offer programmable neuromorphic cores with on-chip learning. IBM's Northpole [@modha2023neural] device comprises over 100 million magnetic tunnel junction synapses and 68 billion transistors. These specialized chips deliver benefits like low power consumption for edge inference. -Spiking neural networks (SNNs) [@maass1997networks] are computational models for neuromorphic hardware. Unlike deep neural networks communicating via continuous values, SNNs use discrete spikes that are more akin to biological neurons. This allows efficient event-based computation rather than constant processing. Additionally, SNNs consider the temporal and spatial characteristics of input data. This better mimics biological neural networks, where the timing of neuronal spikes plays an important role. However, training SNNs remains challenging due to the added temporal complexity. @fig-spiking provides an overview of the spiking methodology: (a) Diagram of a neuron; (b) Measuring an action potential propagated along the axon of a neuron. Only the action potential is detectable along the axon; (c) The neuron's spike is approximated with a binary representation; (d) Event-Driven Processing; (e) Active Pixel Sensor and Dynamic Vision Sensor. +Spiking neural networks (SNNs) [@maass1997networks] are computational models for neuromorphic hardware. Unlike deep neural networks communicating via continuous values, SNNs use discrete spikes that are more akin to biological neurons. This allows efficient event-based computation rather than constant processing. Additionally, SNNs consider the temporal and spatial characteristics of input data. This better mimics biological neural networks, where the timing of neuronal spikes plays an important role. + +However, training SNNs remains challenging due to the added temporal complexity. @fig-spiking provides an overview of the spiking methodology: (a) illustration of a neuron; (b) Measuring an action potential propagated along the axon of a neuron. Only the action potential is detectable along the axon; (c) The neuron's spike is approximated with a binary representation; (d) Event-Driven Processing; (e) Active Pixel Sensor and Dynamic Vision Sensor. ![Neuromorphic spiking. Source: @eshraghian2023training.](images/png/aimage4.png){#fig-spiking} @@ -924,7 +927,7 @@ While in-memory computing technologies like ReRAM and PIM offer exciting prospec ### Optical Computing -In AI acceleration, a burgeoning area of interest lies in novel technologies that deviate from traditional paradigms. Some emerging technologies mentioned above, such as flexible electronics, in-memory computing, or even neuromorphic computing, are close to becoming a reality, given their ground-breaking innovations and applications. One of the promising and leading next-gen frontiers is optical computing technologies [@miller2000optical,@zhou2022photonic ]. Companies like [[LightMatter]](https://lightmatter.co/) are pioneering the use of light photonics for calculations, thereby utilizing photons instead of electrons for data transmission and computation. +In AI acceleration, a burgeoning area of interest lies in novel technologies that deviate from traditional paradigms. Some emerging technologies mentioned above, such as flexible electronics, in-memory computing, or even neuromorphic computing, are close to becoming a reality, given their ground-breaking innovations and applications. One of the promising and leading next-gen frontiers is optical computing technologies [@miller2000optical,@zhou2022photonic ]. Companies like [LightMatter](https://lightmatter.co/) are pioneering the use of light photonics for calculations, thereby utilizing photons instead of electrons for data transmission and computation. Optical computing utilizes photons and photonic devices rather than traditional electronic circuits for computing and data processing. It takes inspiration from fiber optic communication links that rely on light for fast, efficient data transfer [@shastri2021photonics]. Light can propagate with much less loss than semiconductors' electrons, enabling inherent speed and efficiency benefits. @@ -949,7 +952,7 @@ As a result, optical computing is still in the very early research stage despite Quantum computers leverage unique phenomena of quantum physics, like superposition and entanglement, to represent and process information in ways not possible classically. Instead of binary bits, the fundamental unit is the quantum bit or qubit. Unlike classical bits, which are limited to 0 or 1, qubits can exist simultaneously in a superposition of both states due to quantum effects. -Multiple qubits can also be entangled, leading to exponential information density but introducing probabilistic results. Superposition enables parallel computation on all possible states, while entanglement allows nonlocal correlations between qubits. @fig-qubit simulates the structure of a qubit. +Multiple qubits can also be entangled, leading to exponential information density but introducing probabilistic results. Superposition enables parallel computation on all possible states, while entanglement allows nonlocal correlations between qubits. @fig-qubit visually conveys the differences between classical bits in computing and quantum bits (qbits). ![Qubits, the building blocks of quantum computing. Source: [Microsoft](https://azure.microsoft.com/en-gb/resources/cloud-computing-dictionary/what-is-a-qubit)](images/png/qubit.png){#fig-qubit} @@ -968,7 +971,7 @@ However, quantum states are fragile and prone to errors that require error-corre While meaningful quantum advantage for ML remains far off, active research at companies like [D-Wave](https://www.dwavesys.com/company/about-d-wave/), [Rigetti](https://www.rigetti.com/), and [IonQ](https://ionq.com/) is advancing quantum computer engineering and quantum algorithms. Major technology companies like Google, [IBM](https://www.ibm.com/quantum?utm_content=SRCWW&p1=Search&p4C700050385964705&p5=e&gclid=Cj0KCQjw-pyqBhDmARIsAKd9XIPD9U1Sjez_S0z5jeDDE4nRyd6X_gtVDUKJ-HIolx2vOc599KgW8gAaAv8gEALw_wcB&gclsrc=aw.ds), and Microsoft are actively exploring quantum computing. Google recently announced a 72-qubit quantum processor called [Bristlecone](https://blog.research.google/2018/03/a-preview-of-bristlecone-googles-new.html) and plans to build a 49-qubit commercial quantum system. Microsoft also has an active research program in topological quantum computing and collaborates with quantum startup [IonQ](https://ionq.com/) -Quantum techniques may first make inroads into optimization before more generalized ML adoption. Realizing quantum ML's full potential awaits major milestones in quantum hardware development and ecosystem maturity. @fig-q-computing illustrates a comparison between quantum computing and classical computing. +Quantum techniques may first make inroads into optimization before more generalized ML adoption. Realizing quantum ML's full potential awaits major milestones in quantum hardware development and ecosystem maturity. @fig-q-computing illustratively compares quantum computing and classical computing. ![Comparing quantum computing with classical computing. Source: [Devopedia](​​https://devopedia.org/quantum-computing)](images/png/qcomputing.png){#fig-q-computing} diff --git a/contents/introduction/introduction.qmd b/contents/introduction/introduction.qmd index 7bbcfab7..b152f99b 100644 --- a/contents/introduction/introduction.qmd +++ b/contents/introduction/introduction.qmd @@ -8,9 +8,11 @@ bibliography: introduction.bib ## Overview -In the early 1990s, [Mark Weiser](https://en.wikipedia.org/wiki/Mark_Weiser), a pioneering computer scientist, introduced the world to a revolutionary concept that would forever change how we interact with technology. This was succintly captured in the paper he wrote on "The Computer for the 21st Century" (@fig-ubiqutous). He envisioned a future where computing would be seamlessly integrated into our environments, becoming an invisible, integral part of daily life. This vision, which he termed "ubiquitous computing," promised a world where technology would serve us without demanding our constant attention or interaction. Fast forward to today, and we find ourselves on the cusp of realizing Weiser's vision, thanks to the advent and proliferation of machine learning systems. +In the early 1990s, [Mark Weiser](https://en.wikipedia.org/wiki/Mark_Weiser), a pioneering computer scientist, introduced the world to a revolutionary concept that would forever change how we interact with technology. This vision was succinctly captured in his seminal paper, "The Computer for the 21st Century" (see @fig-ubiquitous). Weiser envisioned a future where computing would be seamlessly integrated into our environments, becoming an invisible, integral part of daily life. -![Ubiqutous computing.](images/png/21st_computer.png){#fig-ubiqutous width=50%} +![Ubiquitous computing as envisioned by Mark Weiser.](images/png/21st_computer.png){#fig-ubiquitous width=50%} + +He termed this concept "ubiquitous computing," promising a world where technology would serve us without demanding our constant attention or interaction. Fast forward to today, and we find ourselves on the cusp of realizing Weiser's vision, thanks to the advent and proliferation of machine learning systems. In the vision of ubiquitous computing [@weiser1991computer], the integration of processors into everyday objects is just one aspect of a larger paradigm shift. The true essence of this vision lies in creating an intelligent environment that can anticipate our needs and act on our behalf, enhancing our experiences without requiring explicit commands. To achieve this level of pervasive intelligence, it is crucial to develop and deploy machine learning systems that span the entire ecosystem, from the cloud to the edge and even to the tiniest IoT devices. @@ -24,20 +26,24 @@ However, deploying machine learning systems across the computing continuum prese Furthermore, the varying computational capabilities and energy constraints of devices at different layers of the computing continuum necessitate the development of efficient and adaptable machine learning models. Techniques such as model compression, federated learning, and transfer learning can help address these challenges, enabling the deployment of intelligence across a wide range of devices. -As we move towards the realization of Weiser's vision of ubiquitous computing, the development and deployment of machine learning systems across the entire ecosystem will be critical. By leveraging the strengths of each layer of the computing continuum, we can create an intelligent environment that seamlessly integrates with our daily lives, anticipating our needs and enhancing our experiences in ways that were once unimaginable. As we continue to push the boundaries of what's possible with distributed machine learning, we inch closer to a future where technology becomes an invisible but integral part of our world. @fig-applications-of-ml illustrates some common applications of AI around us. +As we move towards the realization of Weiser's vision of ubiquitous computing, the development and deployment of machine learning systems across the entire ecosystem will be critical. By leveraging the strengths of each layer of the computing continuum, we can create an intelligent environment that seamlessly integrates with our daily lives, anticipating our needs and enhancing our experiences in ways that were once unimaginable. As we continue to push the boundaries of what's possible with distributed machine learning, we inch closer to a future where technology becomes an invisible but integral part of our world. ![Common applications of Machine Learning. Source: [EDUCBA](https://www.educba.com/applications-of-machine-learning/)](images/png/mlapplications.png){#fig-applications-of-ml} +This vision is already beginning to take shape, as illustrated by the common applications of AI surrounding us in our daily lives (see @fig-applications-of-ml). From healthcare and finance to transportation and entertainment, machine learning is transforming various sectors, making our interactions with technology more intuitive and personalized. + ## What's Inside the Book In this book, we will explore the technical foundations of ubiquitous machine learning systems, the challenges of building and deploying these systems across the computing continuum, and the vast array of applications they enable. A unique aspect of this book is its function as a conduit to seminal scholarly works and academic research papers, aimed at enriching the reader's understanding and encouraging deeper exploration of the subject. This approach seeks to bridge the gap between pedagogical materials and cutting-edge research trends, offering a comprehensive guide that is in step with the evolving field of applied machine learning. To improve the learning experience, we have included a variety of supplementary materials. Throughout the book, you will find slides that summarize key concepts, videos that provide in-depth explanations and demonstrations, exercises that reinforce your understanding, and labs that offer hands-on experience with the tools and techniques discussed. These additional resources are designed to cater to different learning styles and help you gain a deeper, more practical understanding of the subject matter. -We begin with the fundamentals, introducing key concepts in systems and machine learning, and providing a deep learning primer. We then guide you through the AI workflow, from data engineering to selecting the right AI frameworks. The training section covers efficient AI training techniques, model optimizations, and AI acceleration using specialized hardware. Deployment is addressed next, with chapters on benchmarking AI, distributed learning, and ML operations. Advanced topics like security, privacy, responsible AI, sustainable AI, robust AI, and generative AI are then explored in depth. The book concludes by highlighting the positive impact of AI and its potential for good. @fig-ml-lifecycle outlines the lifecycle of a machine learning project. +We begin with the fundamentals, introducing key concepts in systems and machine learning, and providing a deep learning primer. We then guide you through the AI workflow, from data engineering to selecting the right AI frameworks. This workflow closely follows the lifecycle of a typical machine learning project, as illustrated in @fig-ml-lifecycle. ![Machine Learning project life cycle. Source:[Medium](https://ihsanulpro.medium.com/complete-machine-learning-project-flowchart-explained-0f55e52b9381)](images/png/mlprojectlifecycle.png){#fig-ml-lifecycle} +The training section covers efficient AI training techniques, model optimizations, and AI acceleration using specialized hardware. Deployment is addressed next, with chapters on benchmarking AI, distributed learning, and ML operations. Advanced topics like security, privacy, responsible AI, sustainable AI, robust AI, and generative AI are then explored in depth. The book concludes by highlighting the positive impact of AI and its potential for good. + ## How to Navigate This Book To get the most out of this book, we recommend a structured learning approach that leverages the various resources provided. Each chapter includes slides, videos, exercises, and labs to cater to different learning styles and reinforce your understanding. Additionally, an AI tutor bot (SocratiQ AI) is readily available to guide you through the content and provide personalized assistance. diff --git a/contents/ml_systems/ml_systems.qmd b/contents/ml_systems/ml_systems.qmd index 6a80338b..fade11b2 100644 --- a/contents/ml_systems/ml_systems.qmd +++ b/contents/ml_systems/ml_systems.qmd @@ -36,7 +36,7 @@ ML is rapidly evolving, with new paradigms reshaping how models are developed, t Modern machine learning systems span a spectrum of deployment options, each with its own set of characteristics and use cases. At one end, we have cloud-based ML, which leverages powerful centralized computing resources for complex, data-intensive tasks. Moving along the spectrum, we encounter edge ML, which brings computation closer to the data source for reduced latency and improved privacy. At the far end, we find TinyML, which enables machine learning on extremely low-power devices with severe memory and processing constraints. -This chapter explores the landscape of contemporary machine learning systems, covering the key approaches of Cloud ML, Edge ML, and TinyML (@fig-cloud-edge-tinyml-comparison). We'll examine the unique characteristics, advantages, and challenges of each approach, as well as the emerging trends and technologies that are shaping the future of machine learning deployment. +This chapter explores the landscape of contemporary machine learning systems, covering three key approaches: Cloud ML, Edge ML, and TinyML. @fig-cloud-edge-tinyml-comparison illustrates the spectrum of distributed intelligence across these approaches, providing a visual comparison of their characteristics. We will examine the unique characteristics, advantages, and challenges of each approach, as depicted in the figure. Additionally, we will discuss the emerging trends and technologies that are shaping the future of machine learning deployment, considering how they might influence the balance between these three paradigms. ![Cloud vs. Edge vs. TinyML: The Spectrum of Distributed Intelligence. Source: ABI Research -- TinyML.](images/png/cloud-edge-tiny.png){#fig-cloud-edge-tinyml-comparison} @@ -56,15 +56,15 @@ Each of these paradigms has its own strengths and is suited to different use cas The progression from Cloud to Edge to TinyML reflects a broader trend in computing towards more distributed, localized processing. This evolution is driven by the need for faster response times, improved privacy, reduced bandwidth usage, and the ability to operate in environments with limited or no connectivity. -@fig-vMLsizes illustrates the key differences between Cloud ML, Edge ML, and TinyML in terms of hardware, latency, connectivity, power requirements, and model complexity. As we move from Cloud to Edge to TinyML, we see a dramatic reduction in available resources, which presents significant challenges for deploying sophisticated machine learning models. - -This resource disparity becomes particularly apparent when attempting to deploy deep learning models on microcontrollers, the primary hardware platform for TinyML. These tiny devices have severely constrained memory and storage capacities, which are often insufficient for conventional deep learning models. We will learn to put these things into perspective in this chapter. +@fig-vMLsizes illustrates the key differences between Cloud ML, Edge ML, and TinyML in terms of hardware, latency, connectivity, power requirements, and model complexity. As we move from Cloud to Edge to TinyML, we see a dramatic reduction in available resources, which presents significant challenges for deploying sophisticated machine learning models. This resource disparity becomes particularly apparent when attempting to deploy deep learning models on microcontrollers, the primary hardware platform for TinyML. These tiny devices have severely constrained memory and storage capacities, which are often insufficient for conventional deep learning models. We will learn to put these things into perspective in this chapter. ![From cloud GPUs to microcontrollers: Navigating the memory and storage landscape across computing devices. Source: [@lin2023tiny]](./images/jpg/cloud_mobile_tiny_sizes.jpg){#fig-vMLsizes} ## Cloud ML -Cloud ML leverages powerful servers in the cloud for training and running large, complex ML models, and relies on internet connectivity. +Cloud ML leverages powerful servers in the cloud for training and running large, complex ML models and relies on internet connectivity. @fig-cloud-ml provides an overview of Cloud ML's capabilities which we will discuss in greater detail throughout this section. + +![Section overview for Cloud ML.](images/png/cloudml.png){#fig-cloud-ml} ### Characteristics @@ -74,7 +74,9 @@ Cloud Machine Learning (Cloud ML) is a subfield of machine learning that leverag **Centralized Infrastructure** -One of the key characteristics of Cloud ML is its centralized infrastructure. Cloud service providers offer a virtual platform that consists of high-capacity servers, expansive storage solutions, and robust networking architectures, all housed in data centers distributed across the globe (@fig-cloudml-example). This centralized setup allows for the pooling and efficient management of computational resources, making it easier to scale machine learning projects as needed. +One of the key characteristics of Cloud ML is its centralized infrastructure. @fig-cloudml-example illustrates this concept with an example from Google's Cloud TPU data center. Cloud service providers offer a virtual platform that consists of high-capacity servers, expansive storage solutions, and robust networking architectures, all housed in data centers distributed across the globe. As shown in the figure, these centralized facilities can be massive in scale, housing rows upon rows of specialized hardware. This centralized setup allows for the pooling and efficient management of computational resources, making it easier to scale machine learning projects as needed. + +![Cloud TPU data center at Google. Source: [Google.](https://blog.google/technology/ai/google-gemini-ai/#scalable-efficient)](images/png/cloud_ml_tpu.png){#fig-cloudml-example} **Scalable Data Processing and Model Training** @@ -94,8 +96,6 @@ By leveraging the pay-as-you-go pricing model offered by cloud service providers Cloud ML has revolutionized the way machine learning is approached, making it more accessible, scalable, and efficient. It has opened up new possibilities for organizations to harness the power of machine learning without the need for significant investments in hardware and infrastructure. -![Cloud TPU data center at Google. Source: [Google.](https://blog.google/technology/ai/google-gemini-ai/#scalable-efficient)](images/png/cloud_ml_tpu.png){#fig-cloudml-example} - ### Benefits Cloud ML offers several significant benefits that make it a powerful choice for machine learning projects: @@ -170,10 +170,7 @@ Cloud ML is deeply integrated into our online experiences, shaping the way we in **Security and Anomaly Detection** -Cloud ML plays a role in bolstering user security by powering anomaly detection systems. These systems continuously monitor user activities and system logs to identify unusual patterns or suspicious behavior. By analyzing vast amounts of data in real-time, Cloud ML algorithms can detect potential cyber threats, such as unauthorized access attempts, malware infections, or data breaches. The cloud's scalability and processing power enable these systems to handle the increasing complexity and volume of security data, providing a proactive approach to protecting users and systems from potential threats. @fig-cloud-ml provides an overview of this section. - -![Section summary for Cloud ML.](images/png/cloudml.png){#fig-cloud-ml} - +Cloud ML plays a role in bolstering user security by powering anomaly detection systems. These systems continuously monitor user activities and system logs to identify unusual patterns or suspicious behavior. By analyzing vast amounts of data in real-time, Cloud ML algorithms can detect potential cyber threats, such as unauthorized access attempts, malware infections, or data breaches. The cloud’s scalability and processing power enable these systems to handle the increasing complexity and volume of security data, providing a proactive approach to protecting users and systems from potential threats. ## Edge ML @@ -181,18 +178,20 @@ Cloud ML plays a role in bolstering user security by powering anomaly detection **Definition of Edge ML** -Edge Machine Learning (Edge ML) runs machine learning algorithms directly on endpoint devices or closer to where the data is generated rather than relying on centralized cloud servers. This approach brings computation closer to the data source, reducing the need to send large volumes of data over networks, often resulting in lower latency and improved data privacy. +Edge Machine Learning (Edge ML) runs machine learning algorithms directly on endpoint devices or closer to where the data is generated rather than relying on centralized cloud servers. This approach brings computation closer to the data source, reducing the need to send large volumes of data over networks, often resulting in lower latency and improved data privacy. @fig-edge-ml provides an overview of this section. + +![Section overview for Edge ML.](images/png/edgeml.png){#fig-edge-ml} **Decentralized Data Processing** -In Edge ML, data processing happens in a decentralized fashion. Instead of sending data to remote servers, the data is processed locally on devices like smartphones, tablets, or Internet of Things (IoT) devices (@fig-edgeml-example). This local processing allows devices to make quick decisions based on the data they collect without relying heavily on a central server's resources. This decentralization is particularly important in real-time applications where even a slight delay can have significant consequences. +In Edge ML, data processing happens in a decentralized fashion, as illustrated in @fig-edgeml-example. Instead of sending data to remote servers, the data is processed locally on devices like smartphones, tablets, or Internet of Things (IoT) devices. The figure showcases various examples of these edge devices, including wearables, industrial sensors, and smart home appliances. This local processing allows devices to make quick decisions based on the data they collect without relying heavily on a central server's resources. + +![Edge ML Examples. Source: Edge Impulse.](images/jpg/edge_ml_iot.jpg){#fig-edgeml-example} **Local Data Storage and Computation** Local data storage and computation are key features of Edge ML. This setup ensures that data can be stored and analyzed directly on the devices, thereby maintaining the privacy of the data and reducing the need for constant internet connectivity. Moreover, this often leads to more efficient computation, as data doesn't have to travel long distances, and computations are performed with a more nuanced understanding of the local context, which can sometimes result in more insightful analyses. -![Edge ML Examples. Source: Edge Impulse.](images/jpg/edge_ml_iot.jpg){#fig-edgeml-example} - ### Benefits **Reduced Latency** @@ -237,9 +236,7 @@ Edge ML plays a crucial role in efficiently managing various systems in smart ho The Industrial IoT leverages Edge ML to monitor and control complex industrial processes. Here, machine learning models can analyze data from numerous sensors in real-time, enabling predictive maintenance, optimizing operations, and enhancing safety measures. This revolution in industrial automation and efficiency is transforming manufacturing and production across various sectors. -The applicability of Edge ML is vast and not limited to these examples. Various other sectors, including healthcare, agriculture, and urban planning, are exploring and integrating Edge ML to develop innovative solutions responsive to real-world needs and challenges, heralding a new era of smart, interconnected systems. @fig-edge-ml provides an overview of this section. - -![Section summary for Edge ML.](images/png/edgeml.png){#fig-edge-ml} +The applicability of Edge ML is vast and not limited to these examples. Various other sectors, including healthcare, agriculture, and urban planning, are exploring and integrating Edge ML to develop innovative solutions responsive to real-world needs and challenges, heralding a new era of smart, interconnected systems. ## Tiny ML @@ -247,7 +244,9 @@ The applicability of Edge ML is vast and not limited to these examples. Various **Definition of TinyML** -TinyML sits at the crossroads of embedded systems and machine learning, representing a burgeoning field that brings smart algorithms directly to tiny microcontrollers and sensors. These microcontrollers operate under severe resource constraints, particularly regarding memory, storage, and computational power (see a TinyML kit example in @fig-tinyml-example). +TinyML sits at the crossroads of embedded systems and machine learning, representing a burgeoning field that brings smart algorithms directly to tiny microcontrollers and sensors. These microcontrollers operate under severe resource constraints, particularly regarding memory, storage, and computational power. @fig-tiny-ml encapsulates the key aspects of TinyML discussed in this section. + +![Section overview for Tiny ML.](images/png/tinyml.png){#fig-tiny-ml} **On-Device Machine Learning** @@ -255,7 +254,7 @@ In TinyML, the focus is on on-device machine learning. This means that machine l **Low Power and Resource-Constrained Environments** -TinyML excels in low-power and resource-constrained settings. These environments require highly optimized solutions that function within the available resources. TinyML meets this need through specialized algorithms and models designed to deliver decent performance while consuming minimal energy, thus ensuring extended operational periods, even in battery-powered devices. +TinyML excels in low-power and resource-constrained settings. These environments require highly optimized solutions that function within the available resources. @fig-tinyml-example showcases an example TinyML device kit, illustrating the compact nature of these systems. These devices can typically fit in the palm of your hand or, in some cases, are even as small as a fingernail. TinyML meets the need for efficiency through specialized algorithms and models designed to deliver decent performance while consuming minimal energy, thus ensuring extended operational periods, even in battery-powered devices like those shown. ![Examples of TinyML device kits. Source: [Widening Access to Applied Machine Learning with TinyML.](https://arxiv.org/pdf/2106.04008.pdf)](images/jpg/tiny_ml.jpg){#fig-tinyml-example} @@ -315,16 +314,16 @@ TinyML can be employed to create anomaly detection models that identify unusual In environmental monitoring, TinyML enables real-time data analysis from various field-deployed sensors. These could range from city air quality monitoring to wildlife tracking in protected areas. Through TinyML, data can be processed locally, allowing for quick responses to changing conditions and providing a nuanced understanding of environmental patterns, crucial for informed decision-making. -In summary, TinyML serves as a trailblazer in the evolution of machine learning, fostering innovation across various fields by bringing intelligence directly to the edge. Its potential to transform our interaction with technology and the world is immense, promising a future where devices are connected, intelligent, and capable of making real-time decisions and responses. @fig-tiny-ml provides an overview of this section. - -![Section summary for Tiny ML.](images/png/tinyml.png){#fig-tiny-ml} +In summary, TinyML serves as a trailblazer in the evolution of machine learning, fostering innovation across various fields by bringing intelligence directly to the edge. Its potential to transform our interaction with technology and the world is immense, promising a future where devices are connected, intelligent, and capable of making real-time decisions and responses. ## Comparison -Up to this point, we've explored each of the different ML variants individually. Now, let's bring them all together for a comprehensive view. @tbl-big_vs_tiny offers a comparative analysis of Cloud ML, Edge ML, and TinyML based on various features and aspects. Additionally, @fig-venn-diagram draws a contrast using a venn diagram. This comparison provides a clear perspective on the unique advantages and distinguishing factors, aiding in making informed decisions based on the specific needs and constraints of a given application or project. +Let's bring together the different ML variants we've explored individually for a comprehensive view. @fig-venn-diagram illustrates the relationships and overlaps between Cloud ML, Edge ML, and TinyML using a Venn diagram. This visual representation effectively highlights the unique characteristics of each approach while also showing areas of commonality. Each ML paradigm has its own distinct features, but there are also intersections where these approaches share certain attributes or capabilities. This diagram helps us understand how these variants relate to each other in the broader landscape of machine learning implementations. ![ML Venn diagram. Source: [arXiv](https://arxiv.org/html/2403.19076v1)](images/png/venndiagram.png){#fig-venn-diagram} +For a more detailed comparison of these ML variants, we can refer to @tbl-big_vs_tiny. This table offers a comprehensive analysis of Cloud ML, Edge ML, and TinyML based on various features and aspects. By examining these different characteristics side by side, we gain a clearer perspective on the unique advantages and distinguishing factors of each approach. This detailed comparison, combined with the visual overview provided by the Venn diagram, aids in making informed decisions based on the specific needs and constraints of a given application or project. + +--------------------------+---------------------------------------------------------+---------------------------------------------------------+----------------------------------------------------------+ | Aspect | Cloud ML | Edge ML | TinyML | +:=========================+:========================================================+:========================================================+:=========================================================+ diff --git a/contents/ondevice_learning/ondevice_learning.bib b/contents/ondevice_learning/ondevice_learning.bib index afc583f7..4b765ea3 100644 --- a/contents/ondevice_learning/ondevice_learning.bib +++ b/contents/ondevice_learning/ondevice_learning.bib @@ -1,360 +1,388 @@ %comment{This file was created with betterbib v5.0.11.} - @inproceedings{abadi2016deep, - author = {Abadi, Martin and Chu, Andy and Goodfellow, Ian and McMahan, H. Brendan and Mironov, Ilya and Talwar, Kunal and Zhang, Li}, - address = {New York, NY, USA}, - booktitle = {Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security}, - date-added = {2023-11-22 18:06:03 -0500}, - date-modified = {2023-11-22 18:08:42 -0500}, - doi = {10.1145/2976749.2978318}, - keywords = {deep learning, differential privacy}, - pages = {308--318}, - publisher = {ACM}, - series = {CCS '16}, - source = {Crossref}, - title = {Deep Learning with Differential Privacy}, - url = {https://doi.org/10.1145/2976749.2978318}, - year = {2016}, - month = oct, + doi = {10.1145/2976749.2978318}, + source = {Crossref}, + author = {Abadi, Martin and Chu, Andy and Goodfellow, Ian and McMahan, H. Brendan and Mironov, Ilya and Talwar, Kunal and Zhang, Li}, + year = {2016}, + month = oct, + url = {https://doi.org/10.1145/2976749.2978318}, + booktitle = {Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security}, + publisher = {ACM}, + title = {Deep Learning with Differential Privacy}, + address = {New York, NY, USA}, + date-added = {2023-11-22 18:06:03 -0500}, + date-modified = {2023-11-22 18:08:42 -0500}, + keywords = {deep learning, differential privacy}, + pages = {308--318}, + series = {CCS '16}, +} + +@inproceedings{mcmahan2017communication, + author = {McMahan, Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and y Arcas, Blaise Ag\"uera}, + title = {Communication-Efficient Learning of Deep Networks from Decentralized Data.}, + journal = {AISTATS}, + pages = {1273--1282}, + year = {2017}, + url = {http://proceedings.mlr.press/v54/mcmahan17a.html}, + source = {DBLP}, + booktitle = {Artificial intelligence and statistics}, + organization = {PMLR}, } @inproceedings{cai2020tinytl, - author = {Cai, Han and Gan, Chuang and Zhu, Ligeng and Han, Song}, - editor = {Larochelle, Hugo and Ranzato, Marc'Aurelio and Hadsell, Raia and Balcan, Maria-Florina and Lin, Hsuan-Tien}, - bibsource = {dblp computer science bibliography, https://dblp.org}, - biburl = {https://dblp.org/rec/conf/nips/CaiGZ020.bib}, - booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual}, - timestamp = {Tue, 19 Jan 2021 00:00:00 +0100}, - title = {{TinyTL:} {Reduce} Memory, Not Parameters for Efficient On-Device Learning}, - url = {https://proceedings.neurips.cc/paper/2020/hash/81f7acabd411274fcf65ce2070ed568a-Abstract.html}, - year = {2020}, + author = {Cai, Han and Gan, Chuang and Zhu, Ligeng and 0003, Song Han}, + title = {TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning.}, + journal = {NeurIPS}, + year = {2020}, + url = {https://proceedings.neurips.cc/paper/2020/hash/81f7acabd411274fcf65ce2070ed568a-Abstract.html}, + source = {DBLP}, + editor = {Larochelle, Hugo and Ranzato, Marc'Aurelio and Hadsell, Raia and Balcan, Maria-Florina and Lin, Hsuan-Tien}, + bibsource = {dblp computer science bibliography, https://dblp.org}, + biburl = {https://dblp.org/rec/conf/nips/CaiGZ020.bib}, + booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual}, + timestamp = {Tue, 19 Jan 2021 00:00:00 +0100}, } @article{chen2016training, - author = {Chen, Tianqi and Xu, Bing and Zhang, Chiyuan and Guestrin, Carlos}, - journal = {ArXiv preprint}, - title = {Training deep nets with sublinear memory cost}, - url = {https://arxiv.org/abs/1604.06174}, - volume = {abs/1604.06174}, - year = {2016}, + url = {http://arxiv.org/abs/1604.06174v2}, + year = {2016}, + month = apr, + title = {Training Deep Nets with Sublinear Memory Cost}, + author = {Chen, Tianqi and Xu, Bing and Zhang, Chiyuan and Guestrin, Carlos}, + primaryclass = {cs.LG}, + archiveprefix = {arXiv}, + journal = {ArXiv preprint}, + volume = {abs/1604.06174}, } @inproceedings{chen2018tvm, - author = {Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Shen, Haichen and Cowan, Meghan and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and others}, - booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)}, - pages = {578--594}, - title = {{TVM:} {An} automated End-to-End optimizing compiler for deep learning}, - year = {2018}, + author = {Chen, Tianqi and Moreau, Thierry and Jiang, Ziheng and Zheng, Lianmin and Yan, Eddie and Shen, Haichen and Cowan, Meghan and Wang, Leyuan and Hu, Yuwei and Ceze, Luis and others}, + booktitle = {13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)}, + pages = {578--594}, + title = {TVM: An automated End-to-End optimizing compiler for deep learning}, + year = {2018}, } @article{chen2023learning, - author = {Chen, Zhiyong and Xu, Shugong}, - doi = {10.1186/s13636-023-00299-2}, - issn = {1687-4722}, - journal = {EURASIP Journal on Audio, Speech, and Music Processing}, - number = {1}, - pages = {33}, - publisher = {Springer Science and Business Media LLC}, - source = {Crossref}, - title = {Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning}, - url = {https://doi.org/10.1186/s13636-023-00299-2}, - volume = {2023}, - year = {2023}, - month = sep, + number = {1}, + doi = {10.1186/s13636-023-00299-2}, + source = {Crossref}, + volume = {2023}, + author = {Chen, Zhiyong and Xu, Shugong}, + year = {2023}, + month = sep, + url = {https://doi.org/10.1186/s13636-023-00299-2}, + issn = {1687-4722}, + journal = {EURASIP Journal on Audio, Speech, and Music Processing}, + publisher = {Springer Science and Business Media LLC}, + title = {Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning}, + pages = {33}, } @article{david2021tensorflow, - author = {David, Robert and Duke, Jared and Jain, Advait and Janapa Reddi, Vijay and Jeffries, Nat and Li, Jian and Kreeger, Nick and Nappier, Ian and Natraj, Meghna and Wang, Tiezhen and others}, - journal = {Proceedings of Machine Learning and Systems}, - pages = {800--811}, - title = {Tensorflow lite micro: {Embedded} machine learning for tinyml systems}, - volume = {3}, - year = {2021}, + author = {David, Robert and Duke, Jared and Jain, Advait and Janapa Reddi, Vijay and Jeffries, Nat and Li, Jian and Kreeger, Nick and Nappier, Ian and Natraj, Meghna and Wang, Tiezhen and others}, + journal = {Proceedings of Machine Learning and Systems}, + pages = {800--811}, + title = {Tensorflow lite micro: Embedded machine learning for tinyml systems}, + volume = {3}, + year = {2021}, } @article{desai2016five, - author = {Desai, Tanvi and Ritchie, Felix and Welpton, Richard and others}, - journal = {Economics Working Paper Series}, - pages = {28}, - title = {Five Safes: {Designing} data access for research}, - volume = {1601}, - year = {2016}, + author = {Desai, Tanvi and Ritchie, Felix and Welpton, Richard and others}, + journal = {Economics Working Paper Series}, + pages = {28}, + title = {Five Safes: Designing data access for research}, + volume = {1601}, + year = {2016}, } @article{dhar2021survey, - author = {Dhar, Sauptik and Guo, Junyao and Liu, Jiayi (Jason) and Tripathi, Samarth and Kurup, Unmesh and Shah, Mohak}, - doi = {10.1145/3450494}, - issn = {2691-1914, 2577-6207}, - journal = {ACM Transactions on Internet of Things}, - number = {3}, - pages = {1--49}, - publisher = {Association for Computing Machinery (ACM)}, - source = {Crossref}, - subtitle = {An Algorithms and Learning Theory Perspective}, - title = {A Survey of On-Device Machine Learning}, - url = {https://doi.org/10.1145/3450494}, - volume = {2}, - year = {2021}, - month = jul, + number = {3}, + doi = {10.1145/3450494}, + pages = {1--49}, + source = {Crossref}, + volume = {2}, + author = {Dhar, Sauptik and Guo, Junyao and Liu, Jiayi (Jason) and Tripathi, Samarth and Kurup, Unmesh and Shah, Mohak}, + subtitle = {An Algorithms and Learning Theory Perspective}, + year = {2021}, + month = jul, + url = {https://doi.org/10.1145/3450494}, + issn = {2691-1914,2577-6207}, + journal = {ACM Transactions on Internet of Things}, + publisher = {Association for Computing Machinery (ACM)}, + title = {A Survey of On-Device Machine Learning}, } @article{dwork2014algorithmic, - author = {Dwork, Cynthia and Roth, Aaron}, - doi = {10.1561/0400000042}, - issn = {1551-305X, 1551-3068}, - journal = {Foundations and Trends{\textregistered} in Theoretical Computer Science}, - number = {3-4}, - pages = {211--407}, - publisher = {Now Publishers}, - source = {Crossref}, - title = {The Algorithmic Foundations of Differential Privacy}, - url = {https://doi.org/10.1561/0400000042}, - volume = {9}, - year = {2013}, + number = {3-4}, + doi = {10.1561/0400000042}, + pages = {211--407}, + source = {Crossref}, + volume = {9}, + author = {Dwork, Cynthia and Roth, Aaron}, + year = {2013}, + url = {https://doi.org/10.1561/0400000042}, + issn = {1551-305X,1551-3068}, + journal = {Foundations and Trends® in Theoretical Computer Science}, + publisher = {Now Publishers}, + title = {The Algorithmic Foundations of Differential Privacy}, } @article{esteva2017dermatologist, - author = {Esteva, Andre and Kuprel, Brett and Novoa, Roberto A. and Ko, Justin and Swetter, Susan M. and Blau, Helen M. and Thrun, Sebastian}, - doi = {10.1038/nature21056}, - issn = {0028-0836, 1476-4687}, - journal = {Nature}, - number = {7639}, - pages = {115--118}, - publisher = {Springer Science and Business Media LLC}, - source = {Crossref}, - title = {Dermatologist-level classification of skin cancer with deep neural networks}, - url = {https://doi.org/10.1038/nature21056}, - volume = {542}, - year = {2017}, - month = jan, + number = {7639}, + doi = {10.1038/nature21056}, + pages = {115--118}, + source = {Crossref}, + volume = {542}, + author = {Esteva, Andre and Kuprel, Brett and Novoa, Roberto A. and Ko, Justin and Swetter, Susan M. and Blau, Helen M. and Thrun, Sebastian}, + year = {2017}, + month = jan, + url = {https://doi.org/10.1038/nature21056}, + issn = {0028-0836,1476-4687}, + journal = {Nature}, + publisher = {Springer Science and Business Media LLC}, + title = {Dermatologist-level classification of skin cancer with deep neural networks}, } @inproceedings{gruslys2016memory, - author = {Gruslys, Audrunas and Munos, R\'emi and Danihelka, Ivo and Lanctot, Marc and Graves, Alex}, - editor = {Lee, Daniel D. and Sugiyama, Masashi and von Luxburg, Ulrike and Guyon, Isabelle and Garnett, Roman}, - bibsource = {dblp computer science bibliography, https://dblp.org}, - biburl = {https://dblp.org/rec/conf/nips/GruslysMDLG16.bib}, - booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain}, - pages = {4125--4133}, - timestamp = {Thu, 21 Jan 2021 00:00:00 +0100}, - title = {Memory-Efficient Backpropagation Through Time}, - url = {https://proceedings.neurips.cc/paper/2016/hash/a501bebf79d570651ff601788ea9d16d-Abstract.html}, - year = {2016}, + author = {Gruslys, Audrunas and Munos, R\'emi and Danihelka, Ivo and Lanctot, Marc and Graves, Alex}, + editor = {Lee, Daniel D. and Sugiyama, Masashi and von Luxburg, Ulrike and Guyon, Isabelle and Garnett, Roman}, + bibsource = {dblp computer science bibliography, https://dblp.org}, + biburl = {https://dblp.org/rec/conf/nips/GruslysMDLG16.bib}, + booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain}, + pages = {4125--4133}, + timestamp = {Thu, 21 Jan 2021 00:00:00 +0100}, + title = {Memory-Efficient Backpropagation Through Time}, + url = {https://proceedings.neurips.cc/paper/2016/hash/a501bebf79d570651ff601788ea9d16d-Abstract.html}, + year = {2016}, } @inproceedings{hong2023publishing, - author = {Hong, Sanghyun and Carlini, Nicholas and Kurakin, Alexey}, - booktitle = {2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)}, - doi = {10.1109/satml54575.2023.00026}, - organization = {IEEE}, - pages = {271--290}, - publisher = {IEEE}, - source = {Crossref}, - title = {Publishing Efficient On-device Models Increases Adversarial Vulnerability}, - url = {https://doi.org/10.1109/satml54575.2023.00026}, - year = {2023}, - month = feb, + doi = {10.1109/satml54575.2023.00026}, + pages = {271--290}, + source = {Crossref}, + volume = {abs 1603 5279}, + author = {Hong, Sanghyun and Carlini, Nicholas and Kurakin, Alexey}, + year = {2023}, + month = feb, + url = {https://doi.org/10.1109/satml54575.2023.00026}, + booktitle = {2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)}, + publisher = {IEEE}, + title = {Publishing Efficient On-device Models Increases Adversarial Vulnerability}, + organization = {IEEE}, } @inproceedings{kairouz2015secure, - author = {Kairouz, Peter and Oh, Sewoong and Viswanath, Pramod}, - editor = {Cortes, Corinna and Lawrence, Neil D. and Lee, Daniel D. and Sugiyama, Masashi and Garnett, Roman}, - bibsource = {dblp computer science bibliography, https://dblp.org}, - biburl = {https://dblp.org/rec/conf/nips/KairouzOV15.bib}, - booktitle = {Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada}, - pages = {2008--2016}, - timestamp = {Thu, 21 Jan 2021 00:00:00 +0100}, - title = {Secure Multi-party Differential Privacy}, - url = {https://proceedings.neurips.cc/paper/2015/hash/a01610228fe998f515a72dd730294d87-Abstract.html}, - year = {2015}, + author = {Kairouz, Peter and Oh, Sewoong and Viswanath, Pramod}, + title = {Secure Multi-party Differential Privacy.}, + journal = {NIPS}, + pages = {2008--2016}, + year = {2015}, + url = {https://proceedings.neurips.cc/paper/2015/hash/a01610228fe998f515a72dd730294d87-Abstract.html}, + source = {DBLP}, + editor = {Cortes, Corinna and Lawrence, Neil D. and Lee, Daniel D. and Sugiyama, Masashi and Garnett, Roman}, + bibsource = {dblp computer science bibliography, https://dblp.org}, + biburl = {https://dblp.org/rec/conf/nips/KairouzOV15.bib}, + booktitle = {Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada}, + timestamp = {Thu, 21 Jan 2021 00:00:00 +0100}, } @article{karargyris2023federated, - author = {Karargyris, Alexandros and Umeton, Renato and Sheller, Micah J and Aristizabal, Alejandro and George, Johnu and Wuest, Anna and Pati, Sarthak and Kassem, Hasan and Zenk, Maximilian and Baid, Ujjwal and others}, - doi = {10.1038/s42256-023-00652-2}, - issn = {2522-5839}, - journal = {Nature Machine Intelligence}, - number = {7}, - pages = {799--810}, - publisher = {Springer Science and Business Media LLC}, - source = {Crossref}, - title = {Federated benchmarking of medical artificial intelligence with {MedPerf}}, - url = {https://doi.org/10.1038/s42256-023-00652-2}, - volume = {5}, - year = {2023}, - month = jul, + number = {7}, + doi = {10.1038/s42256-023-00652-2}, + pages = {799--810}, + source = {Crossref}, + volume = {5}, + author = {Karargyris, Alexandros and Umeton, Renato and Sheller, Micah J. and Aristizabal, Alejandro and George, Johnu and Wuest, Anna and Pati, Sarthak and Kassem, Hasan and Zenk, Maximilian and Baid, Ujjwal and Narayana Moorthy, Prakash and Chowdhury, Alexander and Guo, Junyi and Nalawade, Sahil and Rosenthal, Jacob and Kanter, David and Xenochristou, Maria and Beutel, Daniel J. and Chung, Verena and Bergquist, Timothy and Eddy, James and Abid, Abubakar and Tunstall, Lewis and Sanseviero, Omar and Dimitriadis, Dimitrios and Qian, Yiming and Xu, Xinxing and Liu, Yong and Goh, Rick Siow Mong and Bala, Srini and Bittorf, Victor and Puchala, Sreekar Reddy and Ricciuti, Biagio and Samineni, Soujanya and Sengupta, Eshna and Chaudhari, Akshay and Coleman, Cody and Desinghu, Bala and Diamos, Gregory and Dutta, Debo and Feddema, Diane and Fursin, Grigori and Huang, Xinyuan and Kashyap, Satyananda and Lane, Nicholas and Mallick, Indranil and and and and Mascagni, Pietro and Mehta, Virendra and Moraes, Cassiano Ferro and Natarajan, Vivek and Nikolov, Nikola and Padoy, Nicolas and Pekhimenko, Gennady and Reddi, Vijay Janapa and Reina, G. Anthony and Ribalta, Pablo and Singh, Abhishek and Thiagarajan, Jayaraman J. and Albrecht, Jacob and Wolf, Thomas and Miller, Geralyn and Fu, Huazhu and Shah, Prashant and Xu, Daguang and Yadav, Poonam and Talby, David and Awad, Mark M. and Howard, Jeremy P. and Rosenthal, Michael and Marchionni, Luigi and Loda, Massimo and Johnson, Jason M. and Bakas, Spyridon and Mattson, Peter}, + year = {2023}, + month = jul, + url = {https://doi.org/10.1038/s42256-023-00652-2}, + issn = {2522-5839}, + journal = {Nature Machine Intelligence}, + publisher = {Springer Science and Business Media LLC}, + title = {Federated benchmarking of medical artificial intelligence with MedPerf}, } @article{kwon2023tinytrain, - author = {Kwon, Young D and Li, Rui and Venieris, Stylianos I and Chauhan, Jagmohan and Lane, Nicholas D and Mascolo, Cecilia}, - journal = {ArXiv preprint}, - title = {{TinyTrain:} {Deep} Neural Network Training at the Extreme Edge}, - url = {https://arxiv.org/abs/2307.09988}, - volume = {abs/2307.09988}, - year = {2023}, + url = {http://arxiv.org/abs/2307.09988v2}, + year = {2023}, + month = jul, + title = {TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge}, + author = {Kwon, Young D. and Li, Rui and Venieris, Stylianos I. and Chauhan, Jagmohan and Lane, Nicholas D. and Mascolo, Cecilia}, + primaryclass = {cs.LG}, + archiveprefix = {arXiv}, + journal = {ArXiv preprint}, + volume = {abs/2307.09988}, } @inproceedings{li2016lightrnn, - author = {Li, Xiang and Qin, Tao and Yang, Jian and Liu, Tie-Yan}, - editor = {Lee, Daniel D. and Sugiyama, Masashi and von Luxburg, Ulrike and Guyon, Isabelle and Garnett, Roman}, - bibsource = {dblp computer science bibliography, https://dblp.org}, - biburl = {https://dblp.org/rec/conf/nips/LiQYHL16.bib}, - booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain}, - pages = {4385--4393}, - timestamp = {Thu, 21 Jan 2021 00:00:00 +0100}, - title = {{LightRNN:} {Memory} and Computation-Efficient Recurrent Neural Networks}, - url = {https://proceedings.neurips.cc/paper/2016/hash/c3e4035af2a1cde9f21e1ae1951ac80b-Abstract.html}, - year = {2016}, + author = {Li, Xiang and Qin, Tao and Yang, Jian and Liu, Tie-Yan}, + editor = {Lee, Daniel D. and Sugiyama, Masashi and von Luxburg, Ulrike and Guyon, Isabelle and Garnett, Roman}, + bibsource = {dblp computer science bibliography, https://dblp.org}, + biburl = {https://dblp.org/rec/conf/nips/LiQYHL16.bib}, + booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain}, + pages = {4385--4393}, + timestamp = {Thu, 21 Jan 2021 00:00:00 +0100}, + title = {LightRNN: Memory and Computation-Efficient Recurrent Neural Networks}, + url = {https://proceedings.neurips.cc/paper/2016/hash/c3e4035af2a1cde9f21e1ae1951ac80b-Abstract.html}, + year = {2016}, } @inproceedings{lin2020mcunet, - author = {Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Cohn, John and Gan, Chuang and Han, Song}, - editor = {Larochelle, Hugo and Ranzato, Marc'Aurelio and Hadsell, Raia and Balcan, Maria-Florina and Lin, Hsuan-Tien}, - bibsource = {dblp computer science bibliography, https://dblp.org}, - biburl = {https://dblp.org/rec/conf/nips/LinCLCG020.bib}, - booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual}, - timestamp = {Thu, 11 Feb 2021 00:00:00 +0100}, - title = {{MCUNet:} {Tiny} Deep Learning on {IoT} Devices}, - url = {https://proceedings.neurips.cc/paper/2020/hash/86c51678350f656dcc7f490a43946ee5-Abstract.html}, - year = {2020}, + author = {Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Cohn, John and Gan, Chuang and Han, Song}, + editor = {Larochelle, Hugo and Ranzato, Marc'Aurelio and Hadsell, Raia and Balcan, Maria-Florina and Lin, Hsuan-Tien}, + bibsource = {dblp computer science bibliography, https://dblp.org}, + biburl = {https://dblp.org/rec/conf/nips/LinCLCG020.bib}, + booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual}, + timestamp = {Thu, 11 Feb 2021 00:00:00 +0100}, + title = {MCUNet: Tiny Deep Learning on IoT Devices}, + url = {https://proceedings.neurips.cc/paper/2020/hash/86c51678350f656dcc7f490a43946ee5-Abstract.html}, + year = {2020}, } @article{lin2022device, - author = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song}, - journal = {Adv. Neur. In.}, - pages = {22941--22954}, - title = {On-device training under 256kb memory}, - volume = {35}, - year = {2022}, + author = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song}, + journal = {Adv. Neur. In.}, + pages = {22941--22954}, + title = {On-device training under 256kb memory}, + volume = {35}, + year = {2022}, } @article{moshawrab2023reviewing, - author = {Moshawrab, Mohammad and Adda, Mehdi and Bouzouane, Abdenour and Ibrahim, Hussein and Raad, Ali}, - doi = {10.3390/electronics12102287}, - issn = {2079-9292}, - journal = {Electronics}, - number = {10}, - pages = {2287}, - publisher = {MDPI AG}, - source = {Crossref}, - title = {Reviewing Federated Learning Aggregation Algorithms; Strategies, Contributions, Limitations and Future Perspectives}, - url = {https://doi.org/10.3390/electronics12102287}, - volume = {12}, - year = {2023}, - month = may, + number = {10}, + doi = {10.3390/electronics12102287}, + pages = {2287}, + source = {Crossref}, + volume = {12}, + author = {Moshawrab, Mohammad and Adda, Mehdi and Bouzouane, Abdenour and Ibrahim, Hussein and Raad, Ali}, + year = {2023}, + month = may, + url = {https://doi.org/10.3390/electronics12102287}, + issn = {2079-9292}, + journal = {Electronics}, + publisher = {MDPI AG}, + title = {Reviewing Federated Learning Aggregation Algorithms; Strategies, Contributions, Limitations and Future Perspectives}, } @inproceedings{nguyen2023re, - author = {Nguyen, Ngoc-Bao and Chandrasegaran, Keshigeyan and Abdollahzadeh, Milad and Cheung, Ngai-Man}, - booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, - doi = {10.1109/cvpr52729.2023.01572}, - pages = {16384--16393}, - publisher = {IEEE}, - source = {Crossref}, - title = {Re-Thinking Model Inversion Attacks Against Deep Neural Networks}, - url = {https://doi.org/10.1109/cvpr52729.2023.01572}, - year = {2023}, - month = jun, + doi = {10.1109/cvpr52729.2023.01572}, + pages = {16384--16393}, + source = {Crossref}, + author = {Nguyen, Ngoc-Bao and Chandrasegaran, Keshigeyan and Abdollahzadeh, Milad and Cheung, Ngai-Man}, + year = {2023}, + month = jun, + url = {https://doi.org/10.1109/cvpr52729.2023.01572}, + booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, + publisher = {IEEE}, + title = {Re-Thinking Model Inversion Attacks Against Deep Neural Networks}, } @article{pan2009survey, - author = {Pan, Sinno Jialin and Yang, Qiang}, - doi = {10.1109/tkde.2009.191}, - issn = {1041-4347}, - journal = {IEEE Trans. Knowl. Data Eng.}, - number = {10}, - pages = {1345--1359}, - publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, - source = {Crossref}, - title = {A Survey on Transfer Learning}, - url = {https://doi.org/10.1109/tkde.2009.191}, - volume = {22}, - year = {2010}, - month = oct, + number = {10}, + doi = {10.1109/tkde.2009.191}, + pages = {1345--1359}, + source = {Crossref}, + volume = {22}, + author = {Pan, Sinno Jialin and Yang, Qiang}, + year = {2010}, + month = oct, + url = {https://doi.org/10.1109/tkde.2009.191}, + issn = {1041-4347}, + journal = {IEEE Transactions on Knowledge and Data Engineering}, + publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, + title = {A Survey on Transfer Learning}, } @inproceedings{rouhani2017tinydl, - author = {Darvish Rouhani, Bita and Mirhoseini, Azalia and Koushanfar, Farinaz}, - bdsk-url-1 = {https://doi.org/10.1109/ISCAS.2017.8050343}, - booktitle = {2017 IEEE International Symposium on Circuits and Systems (ISCAS)}, - doi = {10.1109/iscas.2017.8050343}, - pages = {1--4}, - publisher = {IEEE}, - source = {Crossref}, - title = {{TinyDL:} {Just-in-time} deep learning solution for constrained embedded systems}, - url = {https://doi.org/10.1109/iscas.2017.8050343}, - year = {2017}, - month = may, + doi = {10.1109/iscas.2017.8050343}, + pages = {1--4}, + source = {Crossref}, + author = {Darvish Rouhani, Bita and Mirhoseini, Azalia and Koushanfar, Farinaz}, + year = {2017}, + month = may, + url = {https://doi.org/10.1109/iscas.2017.8050343}, + booktitle = {2017 IEEE International Symposium on Circuits and Systems (ISCAS)}, + publisher = {IEEE}, + title = {TinyDL: Just-in-time deep learning solution for constrained embedded systems}, + bdsk-url-1 = {https://doi.org/10.1109/ISCAS.2017.8050343}, } @inproceedings{shi2022data, - author = {Shi, Hongrui and Radu, Valentin}, - booktitle = {Proceedings of the 2nd European Workshop on Machine Learning and Systems}, - doi = {10.1145/3517207.3526980}, - pages = {72--78}, - publisher = {ACM}, - source = {Crossref}, - title = {Data selection for efficient model update in federated learning}, - url = {https://doi.org/10.1145/3517207.3526980}, - year = {2022}, - month = apr, + doi = {10.1145/3517207.3526980}, + pages = {72--78}, + source = {Crossref}, + author = {Shi, Hongrui and Radu, Valentin}, + year = {2022}, + month = apr, + url = {https://doi.org/10.1145/3517207.3526980}, + booktitle = {Proceedings of the 2nd European Workshop on Machine Learning and Systems}, + publisher = {ACM}, + title = {Data selection for efficient model update in federated learning}, } @article{wu2022sustainable, - author = {Wu, Carole-Jean and Raghavendra, Ramya and Gupta, Udit and Acun, Bilge and Ardalani, Newsha and Maeng, Kiwan and Chang, Gloria and Aga, Fiona and Huang, Jinshi and Bai, Charles and others}, - journal = {Proceedings of Machine Learning and Systems}, - pages = {795--813}, - title = {Sustainable ai: {Environmental} implications, challenges and opportunities}, - volume = {4}, - year = {2022}, + author = {Wu, Carole-Jean and Raghavendra, Ramya and Gupta, Udit and Acun, Bilge and Ardalani, Newsha and Maeng, Kiwan and Chang, Gloria and Aga, Fiona and Huang, Jinshi and Bai, Charles and others}, + journal = {Proceedings of Machine Learning and Systems}, + pages = {795--813}, + title = {Sustainable ai: Environmental implications, challenges and opportunities}, + volume = {4}, + year = {2022}, } @article{xu2023federated, - author = {Xu, Zheng and Zhang, Yanxiang and Andrew, Galen and Choquette-Choo, Christopher A and Kairouz, Peter and McMahan, H Brendan and Rosenstock, Jesse and Zhang, Yuanbo}, - journal = {ArXiv preprint}, - title = {Federated Learning of Gboard Language Models with Differential Privacy}, - url = {https://arxiv.org/abs/2305.18465}, - volume = {abs/2305.18465}, - year = {2023}, + url = {http://arxiv.org/abs/2305.18465v2}, + year = {2023}, + month = may, + title = {Federated Learning of Gboard Language Models with Differential Privacy}, + author = {Xu, Zheng and Zhang, Yanxiang and Andrew, Galen and Choquette-Choo, Christopher A. and Kairouz, Peter and McMahan, H. Brendan and Rosenstock, Jesse and Zhang, Yuanbo}, + primaryclass = {cs.LG}, + archiveprefix = {arXiv}, + journal = {ArXiv preprint}, + volume = {abs/2305.18465}, } @inproceedings{yang2023online, - author = {Yang, Tien-Ju and Xiao, Yonghui and Motta, Giovanni and Beaufays, Fran\c{c}oise and Mathews, Rajiv and Chen, Mingqing}, - booktitle = {ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, - doi = {10.1109/icassp49357.2023.10097124}, - organization = {IEEE}, - pages = {1--5}, - publisher = {IEEE}, - source = {Crossref}, - title = {Online Model Compression for Federated Learning with Large Models}, - url = {https://doi.org/10.1109/icassp49357.2023.10097124}, - year = {2023}, - month = jun, + doi = {10.1109/icassp49357.2023.10097124}, + pages = {1--5}, + source = {Crossref}, + author = {Yang, Tien-Ju and Xiao, Yonghui and Motta, Giovanni and Beaufays, Fran\c{c}oise and Mathews, Rajiv and Chen, Mingqing}, + year = {2023}, + month = jun, + url = {https://doi.org/10.1109/icassp49357.2023.10097124}, + booktitle = {ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, + publisher = {IEEE}, + title = {Online Model Compression for Federated Learning with Large Models}, + organization = {IEEE}, } @article{zhao2018federated, - author = {Zhao, Yue and Li, Meng and Lai, Liangzhen and Suda, Naveen and Civin, Damon and Chandra, Vikas}, - journal = {ArXiv preprint}, - title = {Federated learning with non-iid data}, - url = {https://arxiv.org/abs/1806.00582}, - volume = {abs/1806.00582}, - year = {2018}, + url = {http://arxiv.org/abs/1806.00582v2}, + year = {2018}, + month = jun, + title = {Federated Learning with Non-IID Data}, + author = {Zhao, Yue and Li, Meng and Lai, Liangzhen and Suda, Naveen and Civin, Damon and Chandra, Vikas}, + primaryclass = {cs.LG}, + archiveprefix = {arXiv}, + journal = {ArXiv preprint}, + volume = {abs/1806.00582}, } @article{zhuang2021comprehensive, - author = {Zhuang, Fuzhen and Qi, Zhiyuan and Duan, Keyu and Xi, Dongbo and Zhu, Yongchun and Zhu, Hengshu and Xiong, Hui and He, Qing}, - journal = {Proc. IEEE}, - title = {A Comprehensive Survey on Transfer Learning}, - year = {2021}, - volume = {109}, - number = {1}, - pages = {43--76}, - keywords = {Transfer learning;Semisupervised learning;Data models;Covariance matrices;Machine learning;Adaptation models;Domain adaptation;interpretation;machine learning;transfer learning}, - doi = {10.1109/jproc.2020.3004555}, - source = {Crossref}, - url = {https://doi.org/10.1109/jproc.2020.3004555}, - publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, - issn = {0018-9219, 1558-2256}, - month = jan, -} + number = {1}, + doi = {10.1109/jproc.2020.3004555}, + pages = {43--76}, + source = {Crossref}, + volume = {109}, + author = {Zhuang, Fuzhen and Qi, Zhiyuan and Duan, Keyu and Xi, Dongbo and Zhu, Yongchun and Zhu, Hengshu and Xiong, Hui and He, Qing}, + year = {2021}, + month = jan, + url = {https://doi.org/10.1109/jproc.2020.3004555}, + issn = {0018-9219,1558-2256}, + journal = {Proceedings of the IEEE}, + publisher = {Institute of Electrical and Electronics Engineers (IEEE)}, + title = {A Comprehensive Survey on Transfer Learning}, + keywords = {Transfer learning;Semisupervised learning;Data models;Covariance matrices;Machine learning;Adaptation models;Domain adaptation;interpretation;machine learning;transfer learning}, +} \ No newline at end of file diff --git a/contents/ondevice_learning/ondevice_learning.qmd b/contents/ondevice_learning/ondevice_learning.qmd index 71ebd194..80e85c4e 100644 --- a/contents/ondevice_learning/ondevice_learning.qmd +++ b/contents/ondevice_learning/ondevice_learning.qmd @@ -34,7 +34,9 @@ On-device Learning refers to training ML models directly on the device where the An example of On-Device Learning can be seen in a smart thermostat that adapts to user behavior over time. Initially, the thermostat may have a generic model that understands basic usage patterns. However, as it is exposed to more data, such as the times the user is home or away, preferred temperatures, and external weather conditions, the thermostat can refine its model directly on the device to provide a personalized experience. This is all done without sending data back to a central server for processing. -Another example is in predictive text on smartphones. As users type, the phone learns from the user's language patterns and suggests words or phrases that are likely to be used next. This learning happens directly on the device, and the model updates in real-time as more data is collected. A widely used real-world example of on-device learning is Gboard. On an Android phone, Gboard learns from typing and dictation patterns to enhance the experience for all users. On-device learning is also called federated learning. @fig-federated-cycle shows the cycle of federated learning on mobile devices: A. the device learns from user patterns; B. local model updates are communicated to the cloud; C. the cloud server updates the global model and sends the new model to all the devices. +Another example is in predictive text on smartphones. As users type, the phone learns from the user's language patterns and suggests words or phrases that are likely to be used next. This learning happens directly on the device, and the model updates in real-time as more data is collected. A widely used real-world example of on-device learning is [Gboard](https://play.google.com/store/apps/details?id=com.google.android.inputmethod.latin). On an Android phone, Gboard [learns from typing and dictation patterns](https://research.google/blog/federated-learning-collaborative-machine-learning-without-centralized-training-data/) to enhance the experience for all users. On-device learning is also called federated learning. + +@fig-federated-cycle shows the cycle of federated learning on mobile devices: *A.* the device learns from user patterns; *B.* local model updates are communicated to the cloud; *C.* the cloud server updates the global model and sends the new model to all the devices. ![Federated learning cycle. Source: [Google Research.](https://ai.googleblog.com/2017/04/federated-learning-collaborative.html)](images/png/ondevice_intro.png){#fig-federated-cycle} @@ -169,7 +171,7 @@ With some refinements, these classical ML algorithms can be adapted to specific Pruning is a technique for reducing the size and complexity of an ML model to improve its efficiency and generalization performance. This is beneficial for training models on edge devices, where we want to minimize resource usage while maintaining competitive accuracy. -The primary goal of pruning is to remove parts of the model that do not contribute significantly to its predictive power while retaining the most informative aspects. In the context of decision trees, pruning involves removing some branches (subtrees) from the tree, leading to a smaller and simpler tree. In the context of DNN, pruning is used to reduce the number of neurons (units) or connections in the network, as shown in @fig-ondevice-pruning. +The primary goal of pruning is to remove parts of the model that do not contribute significantly to its predictive power while retaining the most informative aspects. In the context of decision trees, pruning involves removing some branches (subtrees) from the tree, leading to a smaller and simpler tree. Similarly, when applied to Deep Neural Networks (DNNs), pruning reduces the number of neurons (units) or connections in the network. This process is illustrated in @fig-ondevice-pruning, which demonstrates how pruning can simplify a neural network structure by eliminating less important connections or units, resulting in a more compact and efficient model. ![Network pruning.](images/jpg/pruning.jpeg){#fig-ondevice-pruning} @@ -191,7 +193,7 @@ Quantization is a common method for reducing the memory footprint of DNN trainin A specific algorithmic technique is Quantization-Aware Scaling (QAS), which improves the performance of neural networks on low-precision hardware, such as edge devices, mobile devices, or TinyML systems, by adjusting the scale factors during the quantization process. -As we discussed in the Model Optimizations chapter, quantization is the process of mapping a continuous range of values to a discrete set of values. In the context of neural networks, quantization often involves reducing the precision of the weights and activations from 32-bit floating point to lower-precision formats such as 8-bit integers. This reduction in precision can significantly reduce the computational cost and memory footprint of the model, making it suitable for deployment on low-precision hardware. @fig-float-int-quantization is an example of float-to-integer quantization. +As we discussed in the [Model Optimizations](../optimizations/optimizations.qmd) chapter, quantization is the process of mapping a continuous range of values to a discrete set of values. In the context of neural networks, this often involves reducing the precision of weights and activations from 32-bit floating point to lower-precision formats such as 8-bit integers. This reduction in precision can significantly decrease the model's computational cost and memory footprint, making it suitable for deployment on low-precision hardware. @fig-float-int-quantization illustrates this concept, showing an example of float-to-integer quantization where high-precision floating-point values are mapped to a more compact integer representation. This visual representation helps to clarify how quantization can maintain the essential structure of the data while reducing its complexity and storage requirements. ![Float to integer quantization. Source: [Nvidia.](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/qat-training-precision.png)](images/png/ondevice_quantization_matrix.png){#fig-float-int-quantization} @@ -231,9 +233,9 @@ Other more common methods of data compression focus on reducing the dimensionali ## Transfer Learning -Transfer learning is an ML technique in which a model developed for a particular task is reused as the starting point for a model on a second task. In the context of on-device AI, transfer learning allows us to leverage pre-trained models that have already learned useful representations from large datasets and finetune them for specific tasks using smaller datasets directly on the device. This can significantly reduce the computational resources and time required for training models from scratch. +Transfer learning is a technique in which a model developed for a particular task is reused as the starting point for a model on a second task. Transfer learning allows us to leverage pre-trained models that have already learned useful representations from large datasets and finetune them for specific tasks using smaller datasets directly on the device. This can significantly reduce the computational resources and time required for training models from scratch. -@fig-transfer-learning-apps includes some intuitive examples of transfer learning from the real world. For instance, if you can ride a bicycle, you know how to balance yourself on two-wheel vehicles. Then, it would be easier for you to learn how to ride a motorcycle than it would be for someone who cannot ride a bicycle. +It can be understood through intuitive real-world examples, as illustrated in @fig-transfer-learning-apps. The figure shows scenarios where skills from one domain can be applied to accelerate learning in a related field. A prime example is the relationship between riding a bicycle and a motorcycle. If you can ride a bicycle, you would have already mastered the skill of balancing on a two-wheeled vehicle. The foundational knowledge about this skill makes it significantly easier for you to learn how to ride a motorcycle compared to someone without any cycling experience. The figure depicts this and other similar scenarios, demonstrating how transfer learning leverages existing knowledge to expedite the acquisition of new, related skills. ![Transferring knowledge between tasks. Source: @zhuang2021comprehensive.](images/png/ondevice_transfer_learning_apps.png){#fig-transfer-learning-apps} @@ -414,13 +416,12 @@ Learn more about transfer learning in @vid-tl below. ## Federated Machine Learning {#sec-fl} -Federated Learning Overview +### Federated Learning Overview The modern internet is full of large networks of connected devices. Whether it's cell phones, thermostats, smart speakers, or other IOT products, countless edge devices are a goldmine for hyper-personalized, rich data. However, with that rich data comes an assortment of problems with information transfer and privacy. Constructing a training dataset in the cloud from these devices would involve high volumes of bandwidth, cost-efficient data transfer, and violation of users' privacy. -Federated learning offers a solution to these problems: train models partially on the edge devices and only communicate model updates to the cloud. In 2016, a team from Google designed architecture for federated learning that attempts to address these problems. +Federated learning offers a solution to these problems: train models partially on the edge devices and only communicate model updates to the cloud. In 2016, a team from Google designed architecture for federated learning that attempts to address these problems. In their initial paper, @mcmahan2017communication outline a principle federated learning algorithm called FederatedAveraging, shown in @fig-federated-avg-algo. Specifically, FederatedAveraging performs stochastic gradient descent (SGD) over several different edge devices. In this process, each device calculates a gradient $g_k = \nabla F_k(w_t)$ which is then applied to update the server-side weights as (with $\eta$ as learning rate across $k$ clients): -In their initial paper, Google outlines a principle federated learning algorithm called FederatedAveraging, which is shown in @fig-federated-avg-algo. Specifically, FederatedAveraging performs stochastic gradient descent (SGD) over several different edge devices. In this process, each device calculates a gradient $g_k = \nabla F_k(w_t)$ which is then applied to update the server-side weights as (with $\eta$ as learning rate across $k$ clients): $$ w_{t+1} \rightarrow w_t - \eta \sum_{k=1}^{K} \frac{n_k}{n}g_k $$ @@ -444,7 +445,6 @@ With this proposed structure, there are a few key vectors for further optimizing ![Federated learning is revolutionizing on-device learning.](images/png/federatedvsoil.png){#fig-federated-learning} - ### Communication Efficiency One of the key bottlenecks in federated learning is communication. Every time a client trains the model, they must communicate their updates back to the server. Similarly, once the server has averaged all the updates, it must send them back to the client. This incurs huge bandwidth and resource costs on large networks of millions of devices. As the field of federated learning advances, a few optimizations have been developed to minimize this communication. To address the footprint of the model, researchers have developed model compression techniques. In the client-server protocol, federated learning can also minimize communication through the selective sharing of updates on clients. Finally, efficient aggregation techniques can also streamline the communication process. @@ -459,7 +459,7 @@ In 2022, another team at Google proposed that each client communicates via a com There are many methods for selectively sharing updates. The general principle is that reducing the portion of the model that the clients are training on the edge reduces the memory necessary for training and the size of communication to the server. In basic federated learning, the client trains the entire model. This means that when a client sends an update to the server, it has gradients for every weight in the network. -However, we cannot just reduce communication by sending pieces of those gradients from each client to the server because the gradients are part of an entire update required to improve the model. Instead, you need to architecturally design the model such that each client trains only a small portion of the broader model, reducing the total communication while still gaining the benefit of training on client data. A paper [@shi2022data] from the University of Sheffield applies this concept to a CNN by splitting the global model into two parts: an upper and a lower part, as shown in @chen2023learning. +However, we cannot just reduce communication by sending pieces of those gradients from each client to the server because the gradients are part of an entire update required to improve the model. Instead, you need to architecturally design the model such that each client trains only a small portion of the broader model, reducing the total communication while still gaining the benefit of training on client data. @shi2022data apply this concept to a CNN by splitting the global model into two parts: an upper and a lower part, as shown in @chen2023learning. ![Split model architecture for selective sharing. Source: Shi et al., ([2022](https://doi.org/10.1145/3517207.3526980)).](images/png/ondevice_split_model.png){#fig-split-model} @@ -467,7 +467,9 @@ However, we cannot just reduce communication by sending pieces of those gradient ### Optimized Aggregation -In addition to reducing the communication overhead, optimizing the aggregation function can improve model training speed and accuracy in certain federated learning use cases. While the standard for aggregation is just averaging, various other approaches can improve model efficiency, accuracy, and security. One alternative is clipped averaging, which clips the model updates within a specific range. Another strategy to preserve security is differential privacy average aggregation. This approach integrates differential privacy into the aggregation step to protect client identities. Each client adds a layer of random noise to their updates before communicating to the server. The server then updates itself with the noisy updates, meaning that the amount of noise needs to be tuned carefully to balance privacy and accuracy. +In addition to reducing the communication overhead, optimizing the aggregation function can improve model training speed and accuracy in certain federated learning use cases. While the standard for aggregation is just averaging, various other approaches can improve model efficiency, accuracy, and security. + +One alternative is clipped averaging, which clips the model updates within a specific range. Another strategy to preserve security is differential privacy average aggregation. This approach integrates differential privacy into the aggregation step to protect client identities. Each client adds a layer of random noise to their updates before communicating to the server. The server then updates itself with the noisy updates, meaning that the amount of noise needs to be tuned carefully to balance privacy and accuracy. In addition to security-enhancing aggregation methods, there are several modifications to the aggregation methods that can improve training speed and performance by adding client metadata along with the weight updates. Momentum aggregation is a technique that helps address the convergence problem. In federated learning, client data can be extremely heterogeneous depending on the different environments in which the devices are used. That means that many models with heterogeneous data may need help to converge. Each client stores a momentum term locally, which tracks the pace of change over several updates. With clients communicating this momentum, the server can factor in the rate of change of each update when changing the global model to accelerate convergence. Similarly, weighted aggregation can factor in the client performance or other parameters like device type or network connection strength to adjust the weight with which the server should incorporate the model updates. Further descriptions of specific aggregation algorithms are provided by @moshawrab2023reviewing. @@ -489,7 +491,7 @@ Considering all of the factors influencing the efficacy of federated learning, l When selecting clients, there are three main components to consider: data heterogeneity, resource allocation, and communication cost. We can select clients on the previously proposed metrics in the non-IID section to address data heterogeneity. In federated learning, all devices may have different amounts of computing, resulting in some being more inefficient at training than others. When selecting a subset of clients for training, one must consider a balance of data heterogeneity and available resources. In an ideal scenario, you can always select the subset of clients with the greatest resources. However, this may skew your dataset, so a balance must be struck. Communication differences add another layer; you want to avoid being bottlenecked by waiting for devices with poor connections to transmit all their updates. Therefore, you must also consider choosing a subset of diverse yet well-connected devices. -### An Example of Deployed Federated Learning: G board +### An Example of Deployed Federated Learning: Gboard A primary example of a deployed federated learning system is Google's Keyboard, Gboard, for Android devices. In implementing federated learning for the keyboard, Google focused on employing differential privacy techniques to protect the user's data and identity. Gboard leverages language models for several key features, such as Next Word Prediction (NWP), Smart Compose (SC), and On-The-Fly rescoring (OTF) [@xu2023federated], as shown in @fig-gboard-features. @@ -521,9 +523,13 @@ Want to train an image-savvy AI without sending your photos to the cloud? Federa ::: -### Benchmarking for Federated Learning: MedPerf +### Benchmarking Federated Learning: MedPerf + +Medical devices represent one of the richest examples of data on the edge. These devices store some of the most personal user data while simultaneously offering significant advances in personalized treatment and improved accuracy in medical AI. This combination of sensitive data and potential for innovation makes medical devices an ideal use case for federated learning. + +A key development in this field is MedPerf, an open-source platform designed for benchmarking models using federated evaluation [@karargyris2023federated]. MedPerf goes beyond traditional federated learning by bringing the model to edge devices for testing against personalized data while maintaining privacy. This approach allows a benchmark committee to evaluate various models in real-world scenarios on edge devices without compromising patient anonymity. -One of the richest examples of data on the edge is medical devices. These devices store some of the most personal data on users but offer huge advances in personalized treatment and better accuracy in medical AI. Given these two factors, medical devices are the perfect use case for federated learning. [MedPerf](https://doi.org/10.1038/s42256-023-00652-2) is an open-source platform used to benchmark models using federated evaluation [@karargyris2023federated]. Instead of just training models via federated learning, MedPerf takes the model to edge devices to test it against personalized data while preserving privacy. In this way, a benchmark committee can evaluate various models in the real world on edge devices while still preserving patient anonymity. +The MedPerf platform, detailed in a recent study (https://doi.org/10.1038/s42256-023-00652-2), demonstrates how federated techniques can be applied not just to model training, but also to model evaluation and benchmarking. This advancement is particularly crucial in the medical field, where the balance between leveraging large datasets for improved AI performance and protecting individual privacy is of utmost importance. ## Security Concerns @@ -699,10 +705,10 @@ By freezing most weights, TinyTL significantly reduces memory usage during on-de TinyTrain significantly reduces the time required for on-device training by selectively updating only certain parts of the model. It does this using a technique called task-adaptive sparse updating, as shown in @fig-tiny-train. -Based on the user data, memory, and computing available on the device, TinyTrain dynamically chooses which neural network layers to update during training. This layer selection is optimized to reduce computation and memory usage while maintaining high accuracy. - ![TinyTrain workflow. Source: @kwon2023tinytrain.](images/png/ondevice_pretraining.png){#fig-tiny-train} +Based on the user data, memory, and computing available on the device, TinyTrain dynamically chooses which neural network layers to update during training. This layer selection is optimized to reduce computation and memory usage while maintaining high accuracy. + More specifically, TinyTrain first does offline pretraining of the model. During pretraining, it not only trains the model on the task data but also meta-trains the model. Meta-training means training the model on metadata about the training process itself. This meta-learning improves the model's ability to adapt accurately even when limited data is available for the target task. Then, during the online adaptation stage, when the model is being customized on the device, TinyTrain performs task-adaptive sparse updates. Using the criteria around the device's capabilities, it selects only certain layers to update through backpropagation. The layers are chosen to balance accuracy, memory usage, and computation time. diff --git a/contents/ops/ops.qmd b/contents/ops/ops.qmd index 46da88e6..8fe55876 100644 --- a/contents/ops/ops.qmd +++ b/contents/ops/ops.qmd @@ -113,10 +113,12 @@ Learn more about ML Lifecycles through a case study featuring speech recognition In this chapter, we will provide an overview of the core components of MLOps, an emerging set of practices that enables robust delivery and lifecycle management of ML models in production. While some MLOps elements like automation and monitoring were covered in previous chapters, we will integrate them into a framework and expand on additional capabilities like governance. Additionally, we will describe and link to popular tools used within each component, such as [LabelStudio](https://labelstud.io/) for data labeling. By the end, we hope that you will understand the end-to-end MLOps methodology that takes models from ideation to sustainable value creation within organizations. -@fig-ops-layers shows the MLOps system stack. The MLOps lifecycle starts from data management and CI/CD pipelines for model development. Developed models go through model training and evaluation. Once trained to convergence, model deployment brings models up to production and ready to serve. After deployment, model serving reacts to workload changes and meets service level agreements cost-effectively when serving millions of end users or AI applications. Infrastructure management ensures the necessary resources are available and optimized throughout the lifecycle. Continuous monitoring, governance, and communication and collaboration are the remaining pieces of MLOps to ensure seamless development and operations of ML models. +@fig-ops-layers illustrates the comprehensive MLOps system stack. It shows the various layers involved in machine learning operations. At the top of the stack are ML Models/Applications, such as BERT, followed by ML Frameworks/Platforms like PyTorch. The core MLOps layer, labeled as Model Orchestration, encompasses several key components: Data Management, CI/CD, Model Training, Model Evaluation, Deployment, and Model Serving. Underpinning the MLOps layer is the Infrastructure layer, represented by technologies such as Kubernetes. This layer manages aspects such as Job Scheduling, Resource Management, Capacity Management, and Monitoring, among others. Holding it all together is the Hardware layer, which provides the necessary computational resources for ML operations. ![The MLOps stack, including ML Models, Frameworks, Model Orchestration, Infrastructure, and Hardware, illustrates the end-to-end workflow of MLOps.](images/png/mlops_overview_layers.png){#fig-ops-layers} +This layered approach in @fig-ops-layers demonstrates how MLOps integrates various technologies and processes to facilitate the development, deployment, and management of machine learning models in a production environment. The figure effectively illustrates the interdependencies between different components and how they come together to form a comprehensive MLOps ecosystem. + ### Data Management {#sec-ops-data-mgmt} Robust data management and data engineering actively empower successful [MLOps](https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning) implementations. Teams properly ingest, store, and prepare raw data from sensors, databases, apps, and other systems for model training and deployment. @@ -492,7 +494,7 @@ Skilled project managers enable MLOps teams to work synergistically to rapidly d ## Embedded System Challenges -We will briefly review the challenges with embedded systems so that it sets the context for the specific challenges that emerge with embedded MLOps, which we will discuss in the following section. +Building on our discussion of [On-device Learning](../optimizations/ondevice_learning.qmd) in the previous chapter, we now turn our attention to the broader context of embedded systems in MLOps. The unique constraints and requirements of embedded environments significantly impact the implementation of machine learning models and operations. To set the stage for the specific challenges that emerge with embedded MLOps, it is important to first review the general challenges associated with embedded systems. This overview will provide a foundation for understanding how these constraints intersect with and shape the practices of MLOps in resource-limited, edge computing scenarios. ### Limited Compute Resources @@ -762,9 +764,9 @@ Despite the proliferation of new MLOps tools in response to the increase in dema * Seamless deployment onto edge devices through compilation, SDKs, and benchmarks * Collaboration features for teams and integration with other platforms -With Edge Impulse, developers with limited data science expertise can develop specialized ML models that run efficiently within small computing environments. It provides a comprehensive solution for creating embedded intelligence and advancing machine learning. @fig-edge-impulse further illustrates this concept. +Edge Impulse offers a comprehensive solution for creating embedded intelligence and advancing machine learning, particularly for developers with limited data science expertise. This platform enables the development of specialized ML models that run efficiently within small computing environments. As illustrated in @fig-edge-impulse, Edge Impulse facilitates the journey from data collection to model deployment, highlighting its user-friendly interface and tools that simplify the creation of embedded ML solutions, thus making it accessible to a broader range of developers and applications. -![The inner workings of edge impulse. Source: [Edge Impulse](https://www.edgeimpulse.com/blog/getting-started-with-edge-impulse/)](images/png/impulse.png){#fig-edge-impulse} +![Edge impulse overview. Source: [Edge Impulse](https://www.edgeimpulse.com/blog/getting-started-with-edge-impulse/)](images/png/impulse.png){#fig-edge-impulse} ##### User Interface diff --git a/contents/optimizations/optimizations.qmd b/contents/optimizations/optimizations.qmd index 285ab5c5..85c292b8 100644 --- a/contents/optimizations/optimizations.qmd +++ b/contents/optimizations/optimizations.qmd @@ -33,13 +33,17 @@ When machine learning models are deployed on systems, especially on resource-con ## Introduction -We have structured this chapter in three tiers. First, in @sec-model_ops_representation we examine the significance and methodologies of reducing the parameter complexity of models without compromising their inference capabilities. Techniques such as pruning and knowledge distillation are discussed, offering insights into how models can be compressed and simplified while maintaining, or even enhancing, their performance. +The optimization of machine learning models for practical deployment is a critical aspect of AI systems. This chapter focuses on exploring model optimization techniques as they relate to the development of ML systems, ranging from high-level model architecture considerations to low-level hardware adaptations. @fig-3-sections Illustrates the three layers of the optimization stack we cover. -Going one level lower, in @sec-model_ops_numerics, we study the role of numerical precision in model computations and how altering it impacts model size, speed, and accuracy. We will examine the various numerical formats and how reduced-precision arithmetic can be leveraged to optimize models for embedded deployment. +![Three layers to be covered.](images/png/modeloptimization_structure.png){#fig-3-sections width=50%} -Finally, as we go lower and closer to the hardware, in @sec-model_ops_hw, we will navigate through the landscape of hardware-software co-design, exploring how models can be optimized by tailoring them to the specific characteristics and capabilities of the target hardware. We will discuss how models can be adapted to exploit the available hardware resources effectively. +At the highest level, we examine methodologies for reducing the complexity of model parameters without compromising inferential capabilities. Techniques such as pruning and knowledge distillation offer powerful approaches to compress and refine models while maintaining or even improving their performance, not only in terms of model quality but also in actual system runtime performance. These methods are crucial for creating efficient models that can be deployed in resource-constrained environments. -![Three layers to be covered.](images/png/modeloptimization_structure.png){#fig-3-sections width=50%} +Furthermore, we explore the role of numerical precision in model computations. Understanding how different levels of numerical precision impact model size, speed, and accuracy is essential for optimizing performance. We investigate various numerical formats and the application of reduced-precision arithmetic, particularly relevant for embedded system deployments where computational resources are often limited. + +At the lowest level, we navigate the intricate landscape of hardware-software co-design. This exploration reveals how models can be tailored to leverage the specific characteristics and capabilities of target hardware platforms. By aligning model design with hardware architecture, we can significantly enhance performance and efficiency. + +This collective approach focuses on helping us develop and deploy efficient, powerful, and hardware-aware machine learning models. From simplifying model architectures to fine-tuning numerical precision and adapting to specific hardware, this chapter covers the full spectrum of optimization strategies. By the conclusion of this chapter, readers will have gained a thorough understanding of various optimization techniques and their practical applications in real-world scenarios. This knowledge is important for creating machine learning models that not only perform well but are also optimized for the constraints and opportunities presented by modern computing environments. ## Efficient Model Representation {#sec-model_ops_representation} @@ -98,7 +102,7 @@ There are several techniques for assigning these importance scores: The idea is to measure, either directly or indirectly, the contribution of each component to the model's output. Structures with minimal influence according to the defined criteria are pruned first. This enables selective, optimized pruning that maximally compresses models while preserving predictive capacity. In general, it is important to evaluate the impact of removing particular structures on the model's output, with recent works such as [@rachwan2022winning] and [@lubana2020gradient] investigating combinations of techniques like magnitude-based pruning and gradient-based pruning. -##### 3. Selecting a pruning strategy +##### 3. Selecting a Pruning Strategy Now that you understand some techniques for determining the importance of structures within a neural network, the next step is to decide how to apply these insights. This involves selecting an appropriate pruning strategy, which dictates how and when the identified structures are removed and how the model is fine-tuned to maintain its performance. Two main structured pruning strategies exist: iterative pruning and one-shot pruning. @@ -112,8 +116,7 @@ Consider a situation where we wish to prune the 6 least effective channels (base The choice between these strategies involves weighing factors like model size, target sparsity level, available compute and acceptable accuracy losses. One-shot pruning can rapidly compress models, but iterative pruning may enable better accuracy retention for a target level of pruning. In practice, the strategy is tailored based on use case constraints. The overarching aim is to generate an optimal strategy that removes redundancy, achieves efficiency gains through pruning, and finely tunes the model to stabilize accuracy at an acceptable level for deployment. -Now consider the same network we had in the iterative pruning example. Whereas in the iterative process we pruned 2 channels at a time, in the one-shot pruning we would prune the 6 channels at once (@fig-oneshot-pruning). Removing 27% of the network's channel simultaneously alters the structure significantly, causing the accuracy to drop from 0.995 to 0.914. Given the major changes, the network is not able to properly adapt during fine-tuning, and the accuracy went up to 0.943, a 5% degradation from the accuracy of the unpruned network. While the final structures in both iterative pruning and oneshot pruning processes are identical, the former is able to maintain high performance while the latter suffers significant degradations. - +Now consider the same network we had in the iterative pruning example. Whereas in the iterative process we pruned 2 channels at a time, in the one-shot pruning we would prune the 6 channels at once, as shown in @fig-oneshot-pruning. Removing 27% of the network's channel simultaneously alters the structure significantly, causing the accuracy to drop from 0.995 to 0.914. Given the major changes, the network is not able to properly adapt during fine-tuning, and the accuracy went up to 0.943, a 5% degradation from the accuracy of the unpruned network. While the final structures in both iterative pruning and oneshot pruning processes are identical, the former is able to maintain high performance while the latter suffers significant degradations. ![One-shot pruning.](images/jpg/modeloptimization_oneshot_pruning.jpeg){#fig-oneshot-pruning} @@ -271,12 +274,12 @@ Furthermore, in scenarios where data evolves or grows over time, developing LRMF #### Tensor Decomposition -You have learned in @sec-tensor-data-structures that tensors are flexible structures, commonly used by ML Frameworks, that can represent data in higher dimensions. Similar to low-rank matrix factorization, more complex models may store weights in higher dimensions, such as tensors. Tensor decomposition is the higher-dimensional analogue of matrix factorization, where a model tensor is decomposed into lower rank components (see @fig-tensor-decomposition). These lower-rank components are easier to compute on and store but may suffer from the same issues mentioned above, such as information loss and the need for nuanced hyperparameter tuning. Mathematically, given a tensor $\mathcal{A}$, tensor decomposition seeks to represent $\mathcal{A}$ as a combination of simpler tensors, facilitating a compressed representation that approximates the original data while minimizing the loss of information. - -The work of Tamara G. Kolda and Brett W. Bader, ["Tensor Decompositions and Applications"](https://epubs.siam.org/doi/abs/10.1137/07070111X) (2009), stands out as a seminal paper in the field of tensor decompositions. The authors provide a comprehensive overview of various tensor decomposition methods, exploring their mathematical underpinnings, algorithms, and a wide array of applications, ranging from signal processing to data mining. Of course, the reason we are discussing it is because it has huge potential for system performance improvements, particularly in the space of TinyML, where throughput and memory footprint savings are crucial to feasibility of deployments. +You have learned in @sec-tensor-data-structures that tensors are flexible structures, commonly used by ML Frameworks, that can represent data in higher dimensions. Similar to low-rank matrix factorization, more complex models may store weights in higher dimensions, such as tensors. Tensor decomposition is the higher-dimensional analogue of matrix factorization, where a model tensor is decomposed into lower-rank components (see @fig-tensor-decomposition). These lower-rank components are easier to compute on and store but may suffer from the same issues mentioned above, such as information loss and the need for nuanced hyperparameter tuning. Mathematically, given a tensor $\mathcal{A}$, tensor decomposition seeks to represent $\mathcal{A}$ as a combination of simpler tensors, facilitating a compressed representation that approximates the original data while minimizing the loss of information. ![Tensor decomposition. Source: @xinyu.](images/png/modeloptimization_tensor_decomposition.png){#fig-tensor-decomposition} +The work of Tamara G. Kolda and Brett W. Bader, ["Tensor Decompositions and Applications"](https://epubs.siam.org/doi/abs/10.1137/07070111X) (2009), stands out as a seminal paper in the field of tensor decompositions. The authors provide a comprehensive overview of various tensor decomposition methods, exploring their mathematical underpinnings, algorithms, and a wide array of applications, ranging from signal processing to data mining. Of course, the reason we are discussing it is because it has huge potential for system performance improvements, particularly in the space of TinyML, where throughput and memory footprint savings are crucial to feasibility of deployments. + :::{#exr-mc .callout-caution collapse="true"} ### Scalable Model Compression with TensorFlow @@ -325,7 +328,7 @@ TinyNAS and MorphNet represent a few of the many significant advancements in the ### Edge-Aware Model Design -Imagine you're building a tiny robot that can identify different flowers. It needs to be smart, but also small and energy-efficient! In the "Edge-Aware Model Design" world, we learned about techniques like depthwise separable convolutions and architectures like SqueezeNet, MobileNet, and EfficientNet – all designed to pack intelligence into compact models. Now, let's see these ideas in action with some xColabs: +Imagine you're building a tiny robot that can identify different flowers. It needs to be smart, but also small and energy-efficient! In the "Edge-Aware Model Design" world, we learned about techniques like depthwise separable convolutions and architectures like SqueezeNet, MobileNet, and EfficientNet---all designed to pack intelligence into compact models. Now, let's see these ideas in action with some xColabs: **SqueezeNet in Action:** Maybe you'd like a Colab showing how to train a SqueezeNet model on a flower image dataset. This would demonstrate its small size and how it learns to recognize patterns despite its efficiency. @@ -566,7 +569,6 @@ Symmetric clipping ranges are the most widely adopted in practice as they have t Asymmetric quantization maps real values to an asymmetrical clipping range that isn't necessarily centered around 0, as shown in @fig-quantization-symmetry on the right. It involves choosing a range [$\alpha$, $\beta$] where $\alpha \neq -\beta$. For example, selecting a range based on the minimum and maximum real values, or where $\alpha = r_{min}$ and $\beta = r_{max}$, creates an asymmetric range. Typically, asymmetric quantization produces tighter clipping ranges compared to symmetric quantization, which is important when target weights and activations are imbalanced, e.g., the activation after the ReLU always has non-negative values. Despite producing tighter clipping ranges, asymmetric quantization is less preferred to symmetric quantization as it doesn't always zero out the real value zero. - ![Quantization (a)symmetry. Source: @gholami2021survey.](images/png/efficientnumerics_symmetry.png){#fig-quantization-symmetry} #### Granularity @@ -594,17 +596,49 @@ Between the two, calculating the range dynamically usually is very costly, so mo ### Techniques -The two prevailing techniques for quantizing models are Post Training Quantization and Quantization-Aware Training. +When optimizing machine learning models for deployment, various quantization techniques are used to balance model efficiency, accuracy, and adaptability. Each method---post-training quantization, quantization-aware training, and dynamic quantization--offers unique advantages and trade-offs, impacting factors such as implementation complexity, computational overhead, and performance optimization. + +@tbl-quantization_methods provides an overview of these quantization methods, highlighting their respective strengths, limitations, and trade-offs. We will delve deeper into each of these methods because they are widely deployed and used across all ML systems of wildly different scales. + ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Aspect | Post Training Quantization | Quantization-Aware Training | Dynamic Quantization | ++:=============================+:=============================+:=============================+:=============================+ +| **Pros** | | | | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Simplicity | ✓ | ✗ | ✗ | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Accuracy Preservation | ✗ | ✓ | ✓ | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Adaptability | ✗ | ✗ | ✓ | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Optimized Performance | ✗ | ✓ | Potentially | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| **Cons** | | | | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Accuracy Degradation | ✓ | ✗ | Potentially | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Computational Overhead | ✗ | ✓ | ✓ | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Implementation Complexity | ✗ | ✓ | ✓ | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| **Tradeoffs** | | | | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Speed vs. Accuracy | ✓ | ✗ | ✗ | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Accuracy vs. Cost | ✗ | ✓ | ✗ | ++------------------------------+------------------------------+------------------------------+------------------------------+ +| Adaptability vs. Overhead | ✗ | ✗ | ✓ | ++------------------------------+------------------------------+------------------------------+------------------------------+ + +: Comparison of post-training quantization, quantization-aware training, and dynamic quantization. {#tbl-quantization_methods .striped .hover} **Post Training Quantization:** Post-training quantization (PTQ) is a quantization technique where the model is quantized after it has been trained. The model is trained in floating point and then weights and activations are quantized as a post-processing step. This is the simplest approach and does not require access to the training data. Unlike Quantization-Aware Training (QAT), PTQ sets weight and activation quantization parameters directly, making it low-overhead and suitable for limited or unlabeled data situations. However, not readjusting the weights after quantizing, especially in low-precision quantization can lead to very different behavior and thus lower accuracy. To tackle this, techniques like bias correction, equalizing weight ranges, and adaptive rounding methods have been developed. PTQ can also be applied in zero-shot scenarios, where no training or testing data are available. This method has been made even more efficient to benefit compute- and memory- intensive large language models. Recently, SmoothQuant, a training-free, accuracy-preserving, and general-purpose PTQ solution which enables 8-bit weight, 8-bit activation quantization for LLMs, has been developed, demonstrating up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy [[@xiao2022smoothquant]](https://arxiv.org/abs/2211.10438). - In PTQ, a pretrained model undergoes a calibration process, as shown in @fig-PTQ-diagram. Calibration involves using a separate dataset known as calibration data, a specific subset of the training data reserved for quantization to help find the appropriate clipping ranges and scaling factors. ![Post-Training Quantization and calibration. Source: @gholami2021survey.](images/png/efficientnumerics_PTQ.png){#fig-PTQ-diagram} -**Quantization-Aware Training:** Quantization-aware training (QAT) is a fine-tuning of the PTQ model. The model is trained aware of quantization, allowing it to adjust for quantization effects. This produces better accuracy with quantized inference. Quantizing a trained neural network model with methods such as PTQ introduces perturbations that can deviate the model from its original convergence point. For instance, Krishnamoorthi showed that even with per-channel quantization, networks like MobileNet do not reach baseline accuracy with int8 Post Training Quantization (PTQ) and require Quantization-Aware Training (QAT) [[@krishnamoorthi2018quantizing]](https://arxiv.org/abs/1806.08342).To address this, QAT retrains the model with quantized parameters, employing forward and backward passes in floating point but quantizing parameters after each gradient update. Handling the non-differentiable quantization operator is crucial; a widely used method is the Straight Through Estimator (STE), approximating the rounding operation as an identity function. While other methods and variations exist, STE remains the most commonly used due to its practical effectiveness. -In QAT, a pretrained model is quantized and then finetuned using training data to adjust parameters and recover accuracy degradation, as shown in @fig-QAT-diagram. The calibration process is often conducted in parallel with the finetuning process for QAT. +**Quantization-Aware Training:** Quantization-aware training (QAT) is a fine-tuning of the PTQ model. The model is trained aware of quantization, allowing it to adjust for quantization effects. This produces better accuracy with quantized inference. Quantizing a trained neural network model with methods such as PTQ introduces perturbations that can deviate the model from its original convergence point. For instance, Krishnamoorthi showed that even with per-channel quantization, networks like MobileNet do not reach baseline accuracy with int8 PTQ and require QAT [[@krishnamoorthi2018quantizing]](https://arxiv.org/abs/1806.08342).To address this, QAT retrains the model with quantized parameters, employing forward and backward passes in floating point but quantizing parameters after each gradient update. Handling the non-differentiable quantization operator is crucial; a widely used method is the Straight Through Estimator (STE), approximating the rounding operation as an identity function. While other methods and variations exist, STE remains the most commonly used due to its practical effectiveness. In QAT, a pretrained model is quantized and then finetuned using training data to adjust parameters and recover accuracy degradation, as shown in @fig-QAT-diagram. The calibration process is often conducted in parallel with the finetuning process for QAT. ![Quantization-Aware Training. Source: @gholami2021survey.](images/png/efficientnumerics_QAT.png){#fig-QAT-diagram} @@ -616,21 +650,6 @@ Quantization-Aware Training serves as a natural extension of Post-Training Quant ![Relative accuracies of PTQ and QAT. Source: @wu2020integer.](images/png/efficientnumerics_PTQQATsummary.png){#fig-quantization-methods-summary} -| **Aspect** | **Post Training Quantization** | **Quantization-Aware Training** | **Dynamic Quantization** | -|:------------------------------|:------------------------------|:------------------------------|:------------------------------| -| **Pros** | | | | -| Simplicity | ✓ | ✗ | ✗ | -| Accuracy Preservation | ✗ | ✓ | ✓ | -| Adaptability | ✗ | ✗ | ✓ | -| Optimized Performance | ✗ | ✓ | Potentially | -| **Cons** | | | | -| Accuracy Degradation| ✓ | ✗ | Potentially | -| Computational Overhead | ✗ | ✓ | ✓ | -| Implementation Complexity | ✗ | ✓ | ✓ | -| **Tradeoffs** | | | | -| Speed vs. Accuracy |✓ | ✗ | ✗ | -| Accuracy vs. Cost | ✗ | ✓ | ✗ | -| Adaptability vs. Overhead | ✗ | ✗ | ✓ | ### Weights vs. Activations @@ -688,7 +707,6 @@ Efficient hardware implementation transcends the selection of suitable component Focusing only on the accuracy when performing Neural Architecture Search leads to models that are exponentially complex and require increasing memory and compute. This has lead to hardware constraints limiting the exploitation of the deep learning models at their full potential. Manually designing the architecture of the model is even harder when considering the hardware variety and limitations. This has lead to the creation of Hardware-aware Neural Architecture Search that incorporate the hardware contractions into their search and optimize the search space for a specific hardware and accuracy. HW-NAS can be categorized based how it optimizes for hardware. We will briefly explore these categories and leave links to related papers for the interested reader. - #### Single Target, Fixed Platform Configuration The goal here is to find the best architecture in terms of accuracy and hardware efficiency for one fixed target hardware. For a specific hardware, the Arduino Nicla Vision for example, this category of HW-NAS will look for the architecture that optimizes accuracy, latency, energy consumption, etc. @@ -783,10 +801,11 @@ In a contrasting approach, hardware can be custom-designed around software requi ![Delegating data processing to an FPGA. Source: @kwon2021hardwaresoftware.](images/png/modeloptimization_preprocessor.png){#fig-fpga-preprocessing} - #### SplitNets -SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. @fig-splitnet-performance demonstrates how SplitNets (in red) achieves higher accuracy for lower latency (running on ImageNet) than different approaches, such as running the DNN on-sensor (All-on-sensor; in green) or on mobile (All-on-aggregator; in blue). Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has its own chip. [@dong2022splitnets] +SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. + +@fig-splitnet-performance demonstrates how SplitNets (in red) achieves higher accuracy for lower latency (running on ImageNet) than different approaches, such as running the DNN on-sensor (All-on-sensor; in green) or on mobile (All-on-aggregator; in blue). Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has its own chip. [@dong2022splitnets] ![SplitNets vs other approaches. Source: @dong2022splitnets.](images/png/modeloptimization_SplitNets.png){#fig-splitnet-performance} @@ -804,9 +823,9 @@ Without the extensive software innovation across frameworks, optimization tools Major machine learning frameworks like TensorFlow, PyTorch, and MXNet provide libraries and APIs to allow common model optimization techniques to be applied without requiring custom implementations. For example, TensorFlow offers the TensorFlow Model Optimization Toolkit which contains modules like: -* [quantization](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/quantization/keras/quantize_model) - Applies quantization-aware training to convert floating point models to lower precision like int8 with minimal accuracy loss. Handles weight and activation quantization. -* [sparsity](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras) - Provides pruning APIs to induce sparsity and remove unnecessary connections in models like neural networks. Can prune weights, layers, etc. -* [clustering](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering) - Supports model compression by clustering weights into groups for higher compression rates. +* **[Quantization](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/quantization/keras/quantize_model)**: Applies quantization-aware training to convert floating point models to lower precision like int8 with minimal accuracy loss. Handles weight and activation quantization. +* **[Sparsity](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras)**: Provides pruning APIs to induce sparsity and remove unnecessary connections in models like neural networks. Can prune weights, layers, etc. +* **[Clustering](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering)**: Supports model compression by clustering weights into groups for higher compression rates. These APIs allow users to enable optimization techniques like quantization and pruning without directly modifying model code. Parameters like target sparsity rates, quantization bit-widths etc. can be configured. Similarly, PyTorch provides torch.quantization for converting models to lower precision representations. TorchTensor and TorchModule form the base classes for quantization support. It also offers torch.nn.utils.prune for built-in pruning of models. MXNet offers gluon.contrib layers that add quantization capabilities like fixed point rounding and stochastic rounding of weights/activations during training. This allows quantization to be readily included in gluon models. @@ -816,9 +835,9 @@ The core benefit of built-in optimizations is that users can apply them without Automated optimization tools provided by frameworks can analyze models and automatically apply optimizations like quantization, pruning, and operator fusion to make the process easier and accessible without excessive manual tuning. In effect, this builds on top of the previous section. For example, TensorFlow provides the TensorFlow Model Optimization Toolkit which contains modules like: -* [QuantizationAwareTraining](https://www.tensorflow.org/model_optimization/guide/quantization/training) - Automatically quantizes weights and activations in a model to lower precision like UINT8 or INT8 with minimal accuracy loss. It inserts fake quantization nodes during training so that the model can learn to be quantization-friendly. -* [Pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras) - Automatically removes unnecessary connections in a model based on analysis of weight importance. Can prune entire filters in convolutional layers or attention heads in transformers. Handles iterative re-training to recover any accuracy loss. -* [GraphOptimizer](https://www.tensorflow.org/guide/graph_optimization) - Applies graph optimizations like operator fusion to consolidate operations and reduce execution latency, especially for inference. In @fig-graph-optimizer, you can see the original (Source Graph) on the left, and how its operations are transformed (consolidated) on the right. Notice how Block1 in Source Graph has 3 separate steps (Convolution, BiasAdd, and Activation), which are then consolidated together in Block1 on Optimized Graph. +* **[QuantizationAwareTraining](https://www.tensorflow.org/model_optimization/guide/quantization/training)**: Automatically quantizes weights and activations in a model to lower precision like UINT8 or INT8 with minimal accuracy loss. It inserts fake quantization nodes during training so that the model can learn to be quantization-friendly. +* **[Pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras)**: Automatically removes unnecessary connections in a model based on analysis of weight importance. Can prune entire filters in convolutional layers or attention heads in transformers. Handles iterative re-training to recover any accuracy loss. +* **[GraphOptimizer](https://www.tensorflow.org/guide/graph_optimization)**: Applies graph optimizations like operator fusion to consolidate operations and reduce execution latency, especially for inference. In @fig-graph-optimizer, you can see the original (Source Graph) on the left, and how its operations are transformed (consolidated) on the right. Notice how Block1 in Source Graph has 3 separate steps (Convolution, BiasAdd, and Activation), which are then consolidated together in Block1 on Optimized Graph. ![GraphOptimizer. Source: @annette2020.](./images/png/source_opt.png){#fig-graph-optimizer} diff --git a/contents/privacy_security/privacy_security.qmd b/contents/privacy_security/privacy_security.qmd index 5e080a9f..7284990f 100644 --- a/contents/privacy_security/privacy_security.qmd +++ b/contents/privacy_security/privacy_security.qmd @@ -338,7 +338,7 @@ Various physical tampering techniques can be used for fault injection. Low volta For ML systems, consequences include impaired model accuracy, denial of service, extraction of private training data or model parameters, and reverse engineering of model architectures. Attackers could use fault injection to force misclassifications, disrupt autonomous systems, or steal intellectual property. -For example, in [@breier2018deeplaser], the authors successfully injected a fault attack into a deep neural network deployed on a microcontroller. They used a laser to heat specific transistors, forcing them to switch states. In one instance, they used this method to attack a ReLU activation function, resulting in the function always outputting a value of 0, regardless of the input. In the assembly code in @fig-injection, the attack caused the executing program always to skip the `jmp end` instruction on line 6. This means that `HiddenLayerOutput[i]` is always set to 0, overwriting any values written to it on lines 4 and 5. As a result, the targeted neurons are rendered inactive, resulting in misclassifications. +For example, @breier2018deeplaser successfully injected a fault attack into a deep neural network deployed on a microcontroller. They used a laser to heat specific transistors, forcing them to switch states. In one instance, they used this method to attack a ReLU activation function, resulting in the function always outputting a value of 0, regardless of the input. In the assembly code shown in @fig-injection, the attack caused the executing program always to skip the `jmp end` instruction on line 6. This means that `HiddenLayerOutput[i]` is always set to 0, overwriting any values written to it on lines 4 and 5. As a result, the targeted neurons are rendered inactive, resulting in misclassifications. ![Fault-injection demonstrated with assembly code. Source: @breier2018deeplaser.](images/png/Fault-injection_demonstrated_with_assembly_code.png){#fig-injection} diff --git a/contents/responsible_ai/responsible_ai.qmd b/contents/responsible_ai/responsible_ai.qmd index 333e83c1..e6ebf4cd 100644 --- a/contents/responsible_ai/responsible_ai.qmd +++ b/contents/responsible_ai/responsible_ai.qmd @@ -338,7 +338,7 @@ To ensure that models keep up to date with such changes in the real world, devel ### Organizational and Cultural Structures -While innovation and regulation are often seen as having competing interests, many countries have found it necessary to provide oversight as AI systems expand into more sectors. As illustrated in @fig-human-centered-ai, this oversight has become crucial as these systems continue permeating various industries and impacting people's lives (see [Human-Centered AI, Chapter 8 "Government Interventions and Regulations"](https://academic-oup-com.ezp-prod1.hul.harvard.edu/book/41126/chapter/350465542). +While innovation and regulation are often seen as having competing interests, many countries have found it necessary to provide oversight as AI systems expand into more sectors. As shown in in @fig-human-centered-ai, this oversight has become crucial as these systems continue permeating various industries and impacting people's lives (see [Human-Centered AI, Chapter 8 "Government Interventions and Regulations"](https://academic-oup-com.ezp-prod1.hul.harvard.edu/book/41126/chapter/350465542). ![How various groups impact human-centered AI. Source: @schneiderman2020.](images/png/human_centered_ai.png){#fig-human-centered-ai} diff --git a/contents/robust_ai/robust_ai.qmd b/contents/robust_ai/robust_ai.qmd index a414342e..29f8b34a 100644 --- a/contents/robust_ai/robust_ai.qmd +++ b/contents/robust_ai/robust_ai.qmd @@ -230,10 +230,10 @@ Intermittent faults can arise from several causes, both internal and external, t Manufacturing defects or process variations can also introduce intermittent faults, where marginal or borderline components may exhibit sporadic failures under specific conditions, as shown in [@fig-intermittent-fault-dram](#kix.7lswkjecl7ra). -Environmental factors, such as temperature fluctuations, humidity, or vibrations, can trigger intermittent faults by altering the electrical characteristics of the components. Loose or degraded connections, such as those in connectors or printed circuit boards, can cause intermittent faults. - ![Residue induced intermittent fault in a DRAM chip. Source: [Hynix Semiconductor](https://ieeexplore.ieee.org/document/4925824)](./images/png/intermittent_fault_dram.png){#fig-intermittent-fault-dram} +Environmental factors, such as temperature fluctuations, humidity, or vibrations, can trigger intermittent faults by altering the electrical characteristics of the components. Loose or degraded connections, such as those in connectors or printed circuit boards, can cause intermittent faults. + #### Mechanisms of Intermittent Faults Intermittent faults can manifest through various mechanisms, depending on the underlying cause and the affected component. One mechanism is the intermittent open or short circuit, where a signal path or connection becomes temporarily disrupted or shorted, causing erratic behavior. Another mechanism is the intermittent delay fault [@zhang2018thundervolt], where the timing of signals or propagation delays becomes inconsistent, leading to synchronization issues or incorrect computations. Intermittent faults can manifest as transient bit flips or soft errors in memory cells or registers, causing data corruption or incorrect program execution. @@ -398,12 +398,12 @@ The landscape of machine learning models is complex and broad, especially given #### Mechanisms of Adversarial Attacks -![Gradient-Based Attacks. Source: [Ivezic](https://defence.ai/ai-security/gradient-based-attacks/)](./images/png/gradient_attack.png){#fig-gradient-attack} - **Gradient-based Attacks** One prominent category of adversarial attacks is gradient-based attacks. These attacks leverage the gradients of the ML model's loss function to craft adversarial examples. The [Fast Gradient Sign Method](https://www.tensorflow.org/tutorials/generative/adversarial_fgsm) (FGSM) is a well-known technique in this category. FGSM perturbs the input data by adding small noise in the gradient direction, aiming to maximize the model's prediction error. FGSM can quickly generate adversarial examples, as shown in [@fig-gradient-attack], by taking a single step in the gradient direction. +![Gradient-Based Attacks. Source: [Ivezic](https://defence.ai/ai-security/gradient-based-attacks/)](./images/png/gradient_attack.png){#fig-gradient-attack} + Another variant, the Projected Gradient Descent (PGD) attack, extends FGSM by iteratively applying the gradient update step, allowing for more refined and powerful adversarial examples. The Jacobian-based Saliency Map Attack (JSMA) is another gradient-based approach that identifies the most influential input features and perturbs them to create adversarial examples. **Optimization-based Attacks** @@ -453,12 +453,12 @@ As adversarial machine learning evolves, researchers explore new attack mechanis Adversarial attacks on machine learning systems have emerged as a significant concern in recent years, highlighting the potential vulnerabilities and risks associated with the widespread adoption of ML technologies. These attacks involve carefully crafted perturbations to input data that can deceive or mislead ML models, leading to incorrect predictions or misclassifications, as shown in [@fig-adversarial-googlenet]. The impact of adversarial attacks on ML systems is far-reaching and can have serious consequences in various domains. +![Adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. Source: [Goodfellow](https://arxiv.org/abs/1412.6572)](./images/png/adversarial_googlenet.png){#fig-adversarial-googlenet} + One striking example of the impact of adversarial attacks was demonstrated by researchers in 2017. They experimented with small black and white stickers on stop signs [@eykholt2018robust]. To the human eye, these stickers did not obscure the sign or prevent its interpretability. However, when images of the sticker-modified stop signs were fed into standard traffic sign classification ML models, a shocking result emerged. The models misclassified the stop signs as speed limit signs over 85% of the time. This demonstration shed light on the alarming potential of simple adversarial stickers to trick ML systems into misreading critical road signs. The implications of such attacks in the real world are significant, particularly in the context of autonomous vehicles. If deployed on actual roads, these adversarial stickers could cause self-driving cars to misinterpret stop signs as speed limits, leading to dangerous situations, as shown in [@fig-graffiti]. Researchers warned that this could result in rolling stops or unintended acceleration into intersections, endangering public safety. -![Adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. Source: [Goodfellow](https://arxiv.org/abs/1412.6572)](./images/png/adversarial_googlenet.png){#fig-adversarial-googlenet} - ![Graffiti on a stop sign tricked a self-driving car into thinking it was a 45 mph speed limit sign. Source: [Eykholt](https://arxiv.org/abs/1707.08945)](./images/png/graffiti.png){#fig-graffiti} The case study of the adversarial stickers on stop signs provides a concrete illustration of how adversarial examples exploit how ML models recognize patterns. By subtly manipulating the input data in ways that are invisible to humans, attackers can induce incorrect predictions and create serious risks, especially in safety-critical applications like autonomous vehicles. The attack's simplicity highlights the vulnerability of ML models to even minor changes in the input, emphasizing the need for robust defenses against such threats. @@ -537,10 +537,10 @@ Data poisoning attacks can be carried out through various mechanisms, exploiting Each of these mechanisms presents unique challenges and requires different mitigation strategies. For example, detecting label manipulation may involve analyzing the distribution of labels and identifying anomalies [@zhou2018learning], while preventing feature manipulation may require secure data preprocessing and anomaly detection techniques [@carta2020local]. Defending against insider threats may involve strict access control policies and monitoring of data access patterns. Moreover, the effectiveness of data poisoning attacks often depends on the attacker's knowledge of the ML system, including the model architecture, training algorithms, and data distribution. Attackers may use adversarial machine learning or data synthesis techniques to craft samples that are more likely to bypass detection and achieve their malicious objectives. -![Garbage In -- Garbage Out. Source: [Information Matters](https://informationmatters.net/data-poisoning-ai/)](./images/png/distribution_shift_example.png){#fig-distribution-shift-example} - **Modifying training data labels:** One of the most straightforward mechanisms of data poisoning is modifying the training data labels. In this approach, the attacker selectively changes the labels of a subset of the training samples to mislead the model's learning process as shown in [@fig-distribution-shift-example]. For example, in a binary classification task, the attacker might flip the labels of some positive samples to negative, or vice versa. By introducing such label noise, the attacker degrades the model's performance or cause it to make incorrect predictions for specific target instances. +![Garbage In -- Garbage Out. Source: [Information Matters](https://informationmatters.net/data-poisoning-ai/)](./images/png/distribution_shift_example.png){#fig-distribution-shift-example} + **Altering feature values in training data:** Another mechanism of data poisoning involves altering the feature values of the training samples without modifying the labels. The attacker carefully crafts the feature values to introduce specific biases or vulnerabilities into the model. For instance, in an image classification task, the attacker might add imperceptible perturbations to a subset of images, causing the model to learn a particular pattern or association. This type of poisoning can create backdoors or trojans in the trained model, which specific input patterns can trigger. **Injecting carefully crafted malicious samples:** In this mechanism, the attacker creates malicious samples designed to poison the model. These samples are crafted to have a specific impact on the model's behavior while blending in with the legitimate training data. The attacker might use techniques such as adversarial perturbations or data synthesis to generate poisoned samples that are difficult to detect. The attacker manipulates the model's decision boundaries by injecting these malicious samples into the training data or introducing targeted misclassifications. @@ -549,10 +549,10 @@ Each of these mechanisms presents unique challenges and requires different mitig **Manipulating data at the source (e.g., sensor data):** In some cases, attackers can manipulate the data at its source, such as sensor data or input devices. By tampering with the sensors or manipulating the environment in which data is collected, attackers can introduce poisoned samples or bias the data distribution. For instance, in a self-driving car scenario, an attacker might manipulate the sensors or the environment to feed misleading information into the training data, compromising the model's ability to make safe and reliable decisions. -![Data Poisoning Attack. Source: [Sikandar](https://www.researchgate.net/publication/366883200_A_Detailed_Survey_on_Federated_Learning_Attacks_and_Defenses)](./images/png/poisoning_attack_example.png){#fig-poisoning-attack-example} - **Poisoning data in online learning scenarios:** Data poisoning attacks can also target ML systems that employ online learning, where the model is continuously updated with new data in real time. In such scenarios, an attacker can gradually inject poisoned samples over time, slowly manipulating the model's behavior. Online learning systems are particularly vulnerable to data poisoning because they adapt to new data without extensive validation, making it easier for attackers to introduce malicious samples, as shown in [@fig-poisoning-attack-example]. +![Data Poisoning Attack. Source: [Sikandar](https://www.researchgate.net/publication/366883200_A_Detailed_Survey_on_Federated_Learning_Attacks_and_Defenses)](./images/png/poisoning_attack_example.png){#fig-poisoning-attack-example} + **Collaborating with insiders to manipulate data:** Sometimes, data poisoning attacks can involve collaboration with insiders with access to the training data. Malicious insiders, such as employees or data providers, can manipulate the data before it is used to train the model. Insider threats are particularly challenging to detect and prevent, as the attackers have legitimate access to the data and can carefully craft the poisoning strategy to evade detection. These are the key mechanisms of data poisoning in ML systems. Attackers often employ these mechanisms to make their attacks more effective and harder to detect. The risk of data poisoning attacks grows as ML systems become increasingly complex and rely on larger datasets from diverse sources. Defending against data poisoning requires a multifaceted approach. ML practitioners and system designers must be aware of the various mechanisms of data poisoning and adopt a comprehensive approach to data security and model resilience. This includes secure data collection, robust data validation, and continuous model performance monitoring. Implementing secure data collection and preprocessing practices is crucial to prevent data poisoning at the source. Data validation and anomaly detection techniques can also help identify and mitigate potential poisoning attempts. Monitoring model performance for signs of data poisoning is also essential to detect and respond to attacks promptly. @@ -577,10 +577,10 @@ Addressing the impact of data poisoning requires a proactive approach to data se ##### Case Study -![Samples of dirty-label poison data regarding mismatched text/image pairs. Source: [Shan](https://arxiv.org/pdf/2310.13828)](./images/png/dirty_label_example.png){#fig-dirty-label-example} - Interestingly enough, data poisoning attacks are not always malicious [@shan2023prompt]. Nightshade, a tool developed by a team led by Professor Ben Zhao at the University of Chicago, utilizes data poisoning to help artists protect their art against scraping and copyright violations by generative AI models. Artists can use the tool to make subtle modifications to their images before uploading them online, as shown in [@fig-dirty-label-example]. +![Samples of dirty-label poison data regarding mismatched text/image pairs. Source: [Shan](https://arxiv.org/pdf/2310.13828)](./images/png/dirty_label_example.png){#fig-dirty-label-example} + While these changes are indiscernible to the human eye, they can significantly disrupt the performance of generative AI models when incorporated into the training data. Generative models can be manipulated to generate hallucinations and weird images. For example, with only 300 poisoned images, the University of Chicago researchers could trick the latest Stable Diffusion model into generating images of dogs that look like cats or images of cows when prompted for cars. As the number of poisoned images on the internet increases, the performance of the models that use scraped data will deteriorate exponentially. First, the poisoned data is hard to detect and requires manual elimination. Second, the "poison" spreads quickly to other labels because generative models rely on connections between words and concepts as they generate images. So a poisoned image of a "car" could spread into generated images associated with words like "truck\," "train\," " bus\," etc. @@ -618,14 +618,14 @@ The key characteristics of distribution shift include: **Unrepresentative training data:** The training data may only partially capture the variability and diversity of the real-world data encountered during deployment. Unrepresentative training data can lead to biased or skewed models that perform poorly on real-world data. Suppose the training data needs to capture the variability and diversity of the real-world data adequately. In that case, the model may learn patterns specific to the training set but needs to generalize better to new, unseen data. This can result in poor performance, biased predictions, and limited model applicability. For instance, if a facial recognition model is trained primarily on images of individuals from a specific demographic group, it may struggle to accurately recognize faces from other demographic groups when deployed in a real-world setting. Ensuring that the training data is representative and diverse is crucial for building models that can generalize well to real-world scenarios. -![Concept drift refers to a change in data patterns and relationships over time. Source: [Evidently AI](https://www.evidentlyai.com/ml-in-production/concept-drift)](./images/png/drift_over_time.png){#fig-drift-over-time} - Distribution shift can manifest in various forms, such as: **Covariate shift:** The distribution of the input features (covariates) changes while the conditional distribution of the target variable given the input remains the same. Covariate shift matters because it can impact the model's ability to make accurate predictions when the input features (covariates) differ between the training and test data. Even if the relationship between the input features and the target variable remains the same, a change in the distribution of the input features can affect the model's performance. For example, consider a model trained to predict housing prices based on features like square footage, number of bedrooms, and location. Suppose the distribution of these features in the test data significantly differs from the training data (e.g., the test data contains houses with much larger square footage). In that case, the model's predictions may become less accurate. Addressing covariate shifts is important to ensure the model's robustness and reliability when applied to new data. **Concept drift:** The relationship between the input features and the target variable changes over time, altering the underlying concept the model is trying to learn, as shown in [@fig-drift-over-time]. Concept drift is important because it indicates changes in the fundamental relationship between the input features and the target variable over time. When the underlying concept that the model is trying to learn shifts, its performance can deteriorate if not adapted to the new concept. For instance, in a customer churn prediction model, the factors influencing customer churn may evolve due to market conditions, competitor offerings, or customer preferences. If the model is not updated to capture these changes, its predictions may become less accurate and irrelevant. Detecting and adapting to concept drift is crucial to maintaining the model's effectiveness and alignment with evolving real-world concepts. +![Concept drift refers to a change in data patterns and relationships over time. Source: [Evidently AI](https://www.evidentlyai.com/ml-in-production/concept-drift)](./images/png/drift_over_time.png){#fig-drift-over-time} + **Domain generalization:** The model must generalize to unseen domains or distributions not present during training. Domain generalization is important because it enables ML models to be applied to new, unseen domains without requiring extensive retraining or adaptation. In real-world scenarios, training data that covers all possible domains or distributions that the model may encounter is often infeasible. Domain generalization techniques aim to learn domain-invariant features or models that can generalize well to new domains. For example, consider a model trained to classify images of animals. If the model can learn features invariant to different backgrounds, lighting conditions, or poses, it can generalize well to classify animals in new, unseen environments. Domain generalization is crucial for building models that can be deployed in diverse and evolving real-world settings. The presence of a distribution shift can significantly impact the performance and reliability of ML models, as the models may need help generalizing well to the new data distribution. Detecting and adapting to distribution shifts is crucial to ensure ML systems' robustness and practical utility in real-world scenarios. @@ -714,18 +714,18 @@ Practitioners can develop more robust and resilient ML systems by leveraging the Recall that data poisoning is an attack that targets the integrity of the training data used to build ML models. By manipulating or corrupting the training data, attackers can influence the model's behavior and cause it to make incorrect predictions or perform unintended actions. Detecting and mitigating data poisoning attacks is crucial to ensure the trustworthiness and reliability of ML systems, as shown in [@fig-adversarial-attack-injection]. -##### Anomaly Detection Techniques for Identifying Poisoned Data - ![Malicious data injection. Source: [Li](https://www.mdpi.com/2227-7390/12/2/247)](./images/png/adversarial_attack_injection.png){#fig-adversarial-attack-injection} +##### Anomaly Detection Techniques for Identifying Poisoned Data + Statistical outlier detection methods identify data points that deviate significantly from most data. These methods assume that poisoned data instances are likely to be statistical outliers. Techniques such as the [Z-score method](https://ubalt.pressbooks.pub/mathstatsguides/chapter/z-score-basics/), [Tukey's method](https://www.itl.nist.gov/div898/handbook/prc/section4/prc471.htm), or the [Mahalanobis distance](https://www.statisticshowto.com/mahalanobis-distance/) can be used to measure the deviation of each data point from the central tendency of the dataset. Data points that exceed a predefined threshold are flagged as potential outliers and considered suspicious for data poisoning. Clustering-based methods group similar data points together based on their features or attributes. The assumption is that poisoned data instances may form distinct clusters or lie far away from the normal data clusters. By applying clustering algorithms like [K-means](https://www.oreilly.com/library/view/data-algorithms/9781491906170/ch12.html), [DBSCAN](https://www.oreilly.com/library/view/machine-learning-algorithms/9781789347999/50efb27d-abbe-4855-ad81-a5357050161f.xhtml), or [hierarchical clustering](https://www.oreilly.com/library/view/cluster-analysis-5th/9780470978443/chapter04.html), anomalous clusters or data points that do not belong to any cluster can be identified. These anomalous instances are then treated as potentially poisoned data. -![Autoencoder. Source: [Dertat](https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798)](./images/png/autoencoder.png){#fig-autoencoder} - Autoencoders are neural networks trained to reconstruct the input data from a compressed representation, as shown in [@fig-autoencoder]. They can be used for anomaly detection by learning the normal patterns in the data and identifying instances that deviate from them. During training, the autoencoder is trained on clean, unpoisoned data. At inference time, the reconstruction error for each data point is computed. Data points with high reconstruction errors are considered abnormal and potentially poisoned, as they do not conform to the learned normal patterns. +![Autoencoder. Source: [Dertat](https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798)](./images/png/autoencoder.png){#fig-autoencoder} + ##### Data Sanitization and Preprocessing Techniques Data poisoning can be avoided by cleaning data, which involves identifying and removing or correcting noisy, incomplete, or inconsistent data points. Techniques such as data deduplication, missing value imputation, and outlier removal can be applied to improve the quality of the training data. By eliminating or filtering out suspicious or anomalous data points, the impact of poisoned instances can be reduced. @@ -770,10 +770,10 @@ In addition, domain classifiers are trained to distinguish between different dom ##### Mitigation Techniques for Distribution Shifts -![Transfer learning. Source: [Bhavsar](https://medium.com/modern-nlp/transfer-learning-in-nlp-f5035cc3f62f)](./images/png/transfer_learning.png){#fig-transfer-learning} - Transfer learning leverages knowledge gained from one domain to improve performance in another, as shown in [@fig-transfer-learning]. By using pre-trained models or transferring learned features from a source domain to a target domain, transfer learning can help mitigate the impact of distribution shifts. The pre-trained model can be fine-tuned on a small amount of labeled data from the target domain, allowing it to adapt to the new distribution. Transfer learning is particularly effective when the source and target domains share similar characteristics or when labeled data in the target domain is scarce. +![Transfer learning. Source: [Bhavsar](https://medium.com/modern-nlp/transfer-learning-in-nlp-f5035cc3f62f)](./images/png/transfer_learning.png){#fig-transfer-learning} + Continual learning, also known as lifelong learning, enables ML models to learn continuously from new data distributions while retaining knowledge from previous distributions. Techniques such as elastic weight consolidation (EWC) [@kirkpatrick2017overcoming] or gradient episodic memory (GEM) [@lopez2017gradient] allow models to adapt to evolving data distributions over time. These techniques aim to balance the plasticity of the model (ability to learn from new data) with the stability of the model (retaining previously learned knowledge). By incrementally updating the model with new data and mitigating catastrophic forgetting, continual learning helps models stay robust to distribution shifts. Data augmentation techniques, such as those we have seen previously, involve applying transformations or perturbations to the existing training data to increase its diversity and improve the model's robustness to distribution shifts. By introducing variations in the data, such as rotations, translations, scaling, or adding noise, data augmentation helps the model learn invariant features and generalize better to unseen distributions. Data augmentation can be performed during training and inference to improve the model's ability to handle distribution shifts. @@ -868,7 +868,7 @@ Adopting a proactive and systematic approach to fault detection and mitigation c ### Fault Tolerance -Get ready to become an AI fault-fighting superhero! Software glitches can derail machine learning systems, but in this Colab, you'll learn how to make them resilient. We'll simulate software faults to see how AI can break, then explore techniques to save your ML model's progress, like checkpoints in a game. You'll see how to train your AI to bounce back after a crash, ensuring it stays on track. This is crucial for building reliable, trustworthy AI, especially in critical applications. So gear up because this Colab directly connects with the Robust AI chapter – you'll move from theory to hands-on troubleshooting and build AI systems that can handle the unexpected! +Get ready to become an AI fault-fighting superhero! Software glitches can derail machine learning systems, but in this Colab, you'll learn how to make them resilient. We'll simulate software faults to see how AI can break, then explore techniques to save your ML model's progress, like checkpoints in a game. You'll see how to train your AI to bounce back after a crash, ensuring it stays on track. This is crucial for building reliable, trustworthy AI, especially in critical applications. So gear up because this Colab directly connects with the Robust AI chapter---you'll move from theory to hands-on troubleshooting and build AI systems that can handle the unexpected! [![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/migrate/fault_tolerance.ipynb#scrollTo=77z2OchJTk0l) ::: @@ -919,8 +919,6 @@ Two of the most common hardware-based fault injection methods are FPGA-based fau **Radiation or Beam Testing:** Radiation or beam testing [@velazco2010combining] involves exposing the hardware running an ML model to high-energy particles, such as protons or neutrons as illustrated in [@fig-beam-testing](#5a77jp776dxi). These particles can cause bitflips or other types of faults in the hardware, mimicking the effects of real-world radiation-induced faults. Beam testing is widely regarded as a highly accurate method for measuring the error rate induced by particle strikes on a running application. It provides a realistic representation of the faults in real-world environments, particularly in applications exposed to high radiation levels, such as space systems or particle physics experiments. However, unlike FPGA-based fault injection, beam testing could be more precise in targeting specific bits or components within the hardware, as it might be difficult to aim the beam of particles to a particular bit in the hardware. Despite being quite expensive from a research standpoint, beam testing is a well-regarded industry practice for reliability. -![](./images/png/image15.png) - ![Radiation test setup for semiconductor components [@lee2022design] Source: [JD Instrument](https://jdinstruments.net/tester-capabilities-radiation-test/)](./images/png/image14.png){#fig-beam-testing} #### Limitations @@ -957,17 +955,17 @@ Software-based fault injection tools also have some limitations compared to hard **Fidelity:** Software-based tools may provide a different level of Fidelity than hardware-based methods in terms of representing real-world fault conditions. The accuracy of the results obtained from software-based fault injection experiments may depend on how closely the software model approximates the actual hardware behavior. -![Comparison of techniques at layers of abstraction. Source: [MAVFI](https://ieeexplore.ieee.org/abstract/document/10315202)](./images/jpg/mavfi.jpg){#fig-mavfi} - ##### Types of Fault Injection Tools Software-based fault injection tools can be categorized based on their target frameworks or use cases. Here, we will discuss some of the most popular tools in each category: Ares [@reagen2018ares], a fault injection tool initially developed for the Keras framework in 2018, emerged as one of the first tools to study the impact of hardware faults on deep neural networks (DNNs) in the context of the rising popularity of ML frameworks in the mid-to-late 2010s. The tool was validated against a DNN accelerator implemented in silicon, demonstrating its effectiveness in modeling hardware faults. Ares provides a comprehensive study on the impact of hardware faults in both weights and activation values, characterizing the effects of single-bit flips and bit-error rates (BER) on hardware structures. Later, the Ares framework was extended to support the PyTorch ecosystem, enabling researchers to investigate hardware faults in a more modern setting and further extending its utility in the field. +PyTorchFI [@mahmoud2020pytorchfi], a fault injection tool specifically designed for the PyTorch framework, was developed in 2020 in collaboration with Nvidia Research. It enables the injection of faults into the weights, activations, and gradients of PyTorch models, supporting a wide range of fault models. By leveraging the GPU acceleration capabilities of PyTorch, PyTorchFI provides a fast and efficient implementation for conducting fault injection experiments on large-scale ML systems, as shown in [@fig-phantom-objects](#txkz61sj1mj4). + ![Hardware bitflips in ML workloads can cause phantom objects and misclassifications, which can erroneously be used downstream by larger systems, such as in autonomous driving. Shown above is a correct and faulty version of the same image using the PyTorchFI injection framework.](./images/png/phantom_objects.png){#fig-phantom-objects} -PyTorchFI [@mahmoud2020pytorchfi], a fault injection tool specifically designed for the PyTorch framework, was developed in 2020 in collaboration with Nvidia Research. It enables the injection of faults into the weights, activations, and gradients of PyTorch models, supporting a wide range of fault models. By leveraging the GPU acceleration capabilities of PyTorch, PyTorchFI provides a fast and efficient implementation for conducting fault injection experiments on large-scale ML systems, as shown in [@fig-phantom-objects](#txkz61sj1mj4). The tool's speed and ease of use have led to widespread adoption in the community, resulting in multiple developer-led projects, such as PyTorchALFI by Intel xColabs, which focuses on safety in automotive environments. Follow-up PyTorch-centric tools for fault injection include Dr. DNA by Meta [@ma2024dr] (which further facilitates the Pythonic programming model for ease of use), and the GoldenEye framework [@mahmoud2022dsn], which incorporates novel numerical datatypes (such as AdaptivFloat [@tambe2020algorithm] and [BlockFloat](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) in the context of hardware bit flips. +The tool's speed and ease of use have led to widespread adoption in the community, resulting in multiple developer-led projects, such as PyTorchALFI by Intel xColabs, which focuses on safety in automotive environments. Follow-up PyTorch-centric tools for fault injection include Dr. DNA by Meta [@ma2024dr] (which further facilitates the Pythonic programming model for ease of use), and the GoldenEye framework [@mahmoud2022dsn], which incorporates novel numerical datatypes (such as AdaptivFloat [@tambe2020algorithm] and [BlockFloat](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) in the context of hardware bit flips. TensorFI [@chen2020tensorfi], or the TensorFlow Fault Injector, is a fault injection tool developed specifically for the TensorFlow framework. Analogous to Ares and PyTorchFI, TensorFI is considered the state-of-the-art tool for ML robustness studies in the TensorFlow ecosystem. It allows researchers to inject faults into the computational graph of TensorFlow models and study their impact on the model's performance, supporting a wide range of fault models. One of the key benefits of TensorFI is its ability to evaluate the resilience of various ML models, not just DNNs. Further advancements, such as BinFi [@chen2019sc], provide a mechanism to speed up error injection experiments by focusing on the "important" bits in the system, accelerating the process of ML robustness analysis and prioritizing the critical components of a model. diff --git a/contents/sustainable_ai/sustainable_ai.qmd b/contents/sustainable_ai/sustainable_ai.qmd index 5a721446..cfca4fb5 100644 --- a/contents/sustainable_ai/sustainable_ai.qmd +++ b/contents/sustainable_ai/sustainable_ai.qmd @@ -99,7 +99,9 @@ What drives such immense requirements? During training, models like GPT-3 learn Developing and training AI models requires immense data, computing power, and energy. However, the deployment and operation of those models also incur significant recurrent resource costs over time. AI systems are now integrated across various industries and applications and are entering the daily lives of an increasing demographic. Their cumulative operational energy and infrastructure impacts could eclipse the upfront model training. -This concept is reflected in the demand for training and inference hardware in data centers and on the edge. Inference refers to using a trained model to make predictions or decisions on real-world data. According to a [recent McKinsey analysis](https://www.mckinsey.com/~/media/McKinsey/Industries/Semiconductors/Our%20Insights/Artificial%20intelligence%20hardware%20New%20opportunities%20for%20semiconductor%20companies/Artificial-intelligence-hardware.ashx), the need for advanced systems to train ever-larger models is rapidly growing. However, inference computations already make up a dominant and increasing portion of total AI workloads, as shown in @fig-mckinsey. Running real-time inference with trained models--whether for image classification, speech recognition, or predictive analytics--invariably demands computing hardware like servers and chips. However, even a model handling thousands of facial recognition requests or natural language queries daily is dwarfed by massive platforms like Meta. Where inference on millions of photos and videos shared on social media, the infrastructure energy requirements continue to scale! +This concept is reflected in the demand for training and inference hardware in data centers and on the edge. Inference refers to using a trained model to make predictions or decisions on real-world data. According to a [recent McKinsey analysis](https://www.mckinsey.com/~/media/McKinsey/Industries/Semiconductors/Our%20Insights/Artificial%20intelligence%20hardware%20New%20opportunities%20for%20semiconductor%20companies/Artificial-intelligence-hardware.ashx), the need for advanced systems to train ever-larger models is rapidly growing. + +However, inference computations already make up a dominant and increasing portion of total AI workloads, as shown in @fig-mckinsey. Running real-time inference with trained models--whether for image classification, speech recognition, or predictive analytics--invariably demands computing hardware like servers and chips. However, even a model handling thousands of facial recognition requests or natural language queries daily is dwarfed by massive platforms like Meta. Where inference on millions of photos and videos shared on social media, the infrastructure energy requirements continue to scale! ![Market size for inference and training hardware. Source: [McKinsey.](https://www.mckinsey.com/~/media/McKinsey/Industries/Semiconductors/Our%20Insights/Artificial%20intelligence%20hardware%20New%20opportunities%20for%20semiconductor%20companies/Artificial-intelligence-hardware.ashx)](images/png/mckinsey_analysis.png){#fig-mckinsey} @@ -429,7 +431,7 @@ Access to the right frameworks and tools is essential to effectively implementin Several software libraries and development environments are specifically tailored for Green AI. These tools often include features for optimizing AI models to reduce their computational load and, consequently, their energy consumption. For example, libraries in PyTorch and TensorFlow that support model pruning, quantization, and efficient neural network architectures enable developers to build AI systems that require less processing power and energy. Additionally, open-source communities like the [Green Software Foundation](https://github.com/Green-Software-Foundation) are creating a centralized carbon intensity metric and building software for carbon-aware computing. -Energy monitoring tools are crucial for Green AI, as they allow developers to measure and analyze the energy consumption of their AI systems. By providing detailed insights into where and how energy is being used, these tools enable developers to make informed decisions about optimizing their models for better energy efficiency. This can involve adjustments in algorithm design, hardware selection, cloud computing software selection, or operational parameters. @fig-azuredashboard is a screenshot of an energy consumption dashboard provided by Microsoft's cloud services platform. +Energy monitoring tools are crucial for Green AI, as they allow developers to measure and analyze the energy consumption of their AI systems. @fig-azuredashboard is a screenshot of an energy consumption dashboard provided by Microsoft's cloud services platform. By providing detailed insights into where and how energy is being used, these tools enable developers to make informed decisions about optimizing their models for better energy efficiency. This can involve adjustments in algorithm design, hardware selection, cloud computing software selection, or operational parameters. ![Microsoft Azure energy consumption dashboard. Source: [Will Buchanan.](https://techcommunity.microsoft.com/t5/green-tech-blog/charting-the-path-towards-sustainable-ai-with-azure-machine/ba-p/2866923)](images/png/azure_dashboard.png){#fig-azuredashboard} diff --git a/contents/training/training.qmd b/contents/training/training.qmd index 95c93fe7..fb47192f 100644 --- a/contents/training/training.qmd +++ b/contents/training/training.qmd @@ -72,10 +72,12 @@ How is this process defined mathematically? Formally, neural networks are mathem ### Neural Network Notation -Diving into the details, the core of a neural network can be viewed as a sequence of alternating linear and nonlinear operations, as show in @fig-neural-net-diagram: +The core of a neural network can be viewed as a sequence of alternating linear and nonlinear operations, as shown in @fig-neural-net-diagram. ![Neural network diagram. Source: astroML.](images/png/aitrainingnn.png){#fig-neural-net-diagram} +Neural networks are structured with layers of neurons connected by weights (representing linear operations) and activation functions (representing nonlinear operations). By examining the figure, we see how information flows through the network, starting from the input layer, passing through one or more hidden layers, and finally reaching the output layer. Each connection between neurons represents a weight, while each neuron typically applies a nonlinear activation function to its inputs. + The neural network operates by taking an input vector $x_i$ and passing it through a series of layers, each of which performs linear and non-linear operations. The output of the network at each layer $A_j$ can be represented as: $$ @@ -1040,9 +1042,9 @@ The batch size used during neural network training and inference significantly i Specifically, let's look at the arithmetic intensity of matrix multiplication during neural network training. This measures the ratio between computational operations and memory transfers. The matrix multiply of two matrices of size $N \times M$ and $M \times B$ requires $N \times M \times B$ multiply-accumulate operations, but only transfers of $N \times M + M \times B$ matrix elements. -As we increase the batch size $B$, the number of arithmetic operations grows faster than the memory transfers. For example, with a batch size of 1, we need $N \times M$ operations and $N + M$ transfers, giving an arithmetic intensity ratio of around $\frac{N \times M}{N+M}$. But with a large batch size of 128, the intensity ratio becomes $\frac{128 \times N \times M}{N \times M + M \times 128} \approx 128$. Using a larger batch size shifts the overall computation from memory-bounded to more compute-bounded. AI training uses large batch sizes and is generally limited by peak arithmetic computational performance, i.e., Application 3 in @fig-roofline. +As we increase the batch size $B$, the number of arithmetic operations grows faster than the memory transfers. For example, with a batch size of 1, we need $N \times M$ operations and $N + M$ transfers, giving an arithmetic intensity ratio of around $\frac{N \times M}{N+M}$. But with a large batch size of 128, the intensity ratio becomes $\frac{128 \times N \times M}{N \times M + M \times 128} \approx 128$. -Therefore, batched matrix multiplication is far more computationally intensive than memory access bound. This has implications for hardware design and software optimizations, which we will cover next. The key insight is that we can significantly alter the computational profile and bottlenecks posed by neural network training and inference by tuning the batch size. +Using a larger batch size shifts the overall computation from memory-bounded to more compute-bounded. AI training uses large batch sizes and is generally limited by peak arithmetic computational performance, i.e., Application 3 in @fig-roofline. Therefore, batched matrix multiplication is far more computationally intensive than memory access bound. This has implications for hardware design and software optimizations, which we will cover next. The key insight is that we can significantly alter the computational profile and bottlenecks posed by neural network training and inference by tuning the batch size. ![AI training roofline model.](images/png/aitrainingroof.png){#fig-roofline} diff --git a/contents/workflow/workflow.qmd b/contents/workflow/workflow.qmd index fab73039..639f4095 100644 --- a/contents/workflow/workflow.qmd +++ b/contents/workflow/workflow.qmd @@ -10,10 +10,10 @@ Resources: [Slides](#sec-ai-workflow-resource), [Videos](#sec-ai-workflow-resour ![_DALL·E 3 Prompt: Create a rectangular illustration of a stylized flowchart representing the AI workflow/pipeline. From left to right, depict the stages as follows: 'Data Collection' with a database icon, 'Data Preprocessing' with a filter icon, 'Model Design' with a brain icon, 'Training' with a weight icon, 'Evaluation' with a checkmark, and 'Deployment' with a rocket. Connect each stage with arrows to guide the viewer horizontally through the AI processes, emphasizing these steps' sequential and interconnected nature._](images/png/cover_ai_workflow.png) -In this chapter, we'll explore the machine learning (ML) workflow, setting the stage for subsequent chapters that go deeper into the specifics. To ensure we see the bigger picture, this chapter offers a high-level overview of the steps involved in the ML workflow. - The ML workflow is a structured approach that guides professionals and researchers through developing, deploying, and maintaining ML models. This workflow is generally divided into several crucial stages, each contributing to the effective development of intelligent systems. +In this chapter, we will explore the machine learning workflow, setting the stage for subsequent chapters that go deeper into the specifics. This chapter focuses only presenting a high-level overview of the steps involved in the ML workflow. + ::: {.callout-tip} ## Learning Objectives @@ -36,7 +36,7 @@ The ML workflow is a structured approach that guides professionals and researche ![Multi-step design methodology for the development of a machine learning model. Commonly referred to as the machine learning lifecycle](images/png/ML_life_cycle.png){#fig-ml-life-cycle} -Developing a successful machine learning model requires a systematic workflow. This end-to-end process enables you to build, deploy, and maintain models effectively. As shown in @fig-ml-life-cycle, It typically involves the following key steps: +@fig-ml-life-cycle illustrates the systematic workflow required for developing a successful machine learning model. This end-to-end process, commonly referred to as the machine learning lifecycle, enables you to build, deploy, and maintain models effectively. It typically involves the following key steps: 1. **Problem Definition** - Start by clearly articulating the specific problem you want to solve. This focuses on your efforts during data collection and model building. 2. **Data Collection and Preparation:** Gather relevant, high-quality training data that captures all aspects of the problem. Clean and preprocess the data to prepare it for modeling. @@ -94,7 +94,7 @@ The deployment phase often requires specialized hardware and infrastructure, as As models make decisions that can impact individuals and society, ethical and legal aspects of machine learning are becoming increasingly important. Ethicists and legal advisors are needed to ensure compliance with ethical standards and legal regulations. -@tbl-mlops_roles shows a rundown of the typical roles involved. While the lines between these roles can sometimes blur, the table below provides a general overview. +Understanding the various roles involved in an ML project is crucial for its successful completion. @tbl-mlops_roles provides a general overview of these typical roles, although it's important to note that the lines between them can sometimes blur. Let's examine this breakdown: +----------------------------------------+----------------------------------------------------------------------------------------------------+ | Role | Responsibilities | @@ -128,13 +128,15 @@ As models make decisions that can impact individuals and society, ethical and le : Roles and responsibilities of people involved in MLOps. {#tbl-mlops_roles .striped .hover} -Understanding these roles is crucial for completing an ML project. As we proceed through the upcoming chapters, we'll explore each role's essence and expertise, fostering a comprehensive understanding of the complexities involved in embedded AI projects. This holistic view facilitates seamless collaboration and nurtures an environment ripe for innovation and breakthroughs. +As we proceed through the upcoming chapters, we will explore each role's essence and expertise and foster a deeper understanding of the complexities involved in AI projects. This holistic view facilitates seamless collaboration and nurtures an environment ripe for innovation and breakthroughs. ## Conclusion -This chapter has laid the foundation for understanding the machine learning workflow, a structured approach crucial for the development, deployment, and maintenance of ML models. By exploring the distinct stages of the ML lifecycle, we have gained insights into the unique challenges faced by traditional ML and embedded AI workflows, particularly in terms of resource optimization, real-time processing, data management, and hardware-software integration. These distinctions underscore the importance of tailoring workflows to meet the specific demands of the application environment. +This chapter has laid the foundation for understanding the machine learning workflow, a structured approach crucial for the development, deployment, and maintenance of ML models. We explored the unique challenges faced in ML workflows, where resource optimization, real-time processing, data management, and hardware-software integration are paramount. These distinctions underscore the importance of tailoring workflows to meet the specific demands of the application environment. + +Moreover, we emphasized the significance of multidisciplinary collaboration in ML projects. By examining the diverse roles involved, from data scientists to software engineers, we gained an overview of the teamwork necessary to navigate the experimental and resource-intensive nature of ML development. This understanding is crucial for fostering effective communication and collaboration across different domains of expertise. -The chapter emphasized the significance of multidisciplinary collaboration in ML projects. Understanding the diverse roles provides a comprehensive view of the teamwork necessary to navigate the experimental and resource-intensive nature of ML development. As we move forward to more detailed discussions in the subsequent chapters, this high-level overview equips us with a holistic perspective on the ML workflow and the various roles involved. +As we move forward to more detailed discussions in subsequent chapters, this high-level overview equips us with a holistic perspective on the ML workflow and the various roles involved. This foundation will prove important as we dive into specific aspects of machine learning, which will allow us to contextualize advanced concepts within the broader framework of ML development and deployment. ## Resources {#sec-ai-workflow-resource}