From e0543379cada07db1806964d97f08354349a78b1 Mon Sep 17 00:00:00 2001 From: elizakimball Date: Wed, 13 Nov 2024 22:36:05 -0500 Subject: [PATCH] add sidenotes to frameworks and cleaned up bug fixes for sidenotes in previous chapters --- .../data_engineering/data_engineering.qmd | 3 + contents/core/frameworks/frameworks.qmd | 62 +++++++++++++------ contents/core/ml_systems/ml_systems.qmd | 8 ++- 3 files changed, 52 insertions(+), 21 deletions(-) diff --git a/contents/core/data_engineering/data_engineering.qmd b/contents/core/data_engineering/data_engineering.qmd index e72db4ae..a9f70aa0 100644 --- a/contents/core/data_engineering/data_engineering.qmd +++ b/contents/core/data_engineering/data_engineering.qmd @@ -115,6 +115,7 @@ In this context, using KWS as an example, we can break each of the steps out as * Environmental Challenges: Devices might be deployed in various environments, from quiet bedrooms to noisy industrial settings. The KWS system must be robust enough to function effectively across these scenarios. [^island]: The always-on island of the SoC refers to a subsystem that is specialized to handle low-power, always-on tasks within the embedded device such as wake-up commands. It continuously monitors specific sensors and controls the power management functions to wake up various components of the deice when necessary. By allowing different power states for various components, the always-on island ensures efficient energy usage and quick response time. + 6. **Data Collection and Analysis:** For a KWS system, the quality and diversity of data are paramount. Considerations might include: * Variety of Accents: Collect data from speakers with various accents to ensure wide-ranging recognition. @@ -125,6 +126,7 @@ In this context, using KWS as an example, we can break each of the steps out as Once a prototype KWS system is developed, it's crucial to test it in real-world scenarios[^user-input], gather feedback, and iteratively refine the model. This ensures that the system remains aligned with the defined problem and objectives. This is important because the deployment scenarios change over time as things evolve. [^user-input]: When refining a model based on user input, it is essential to ensure privacy laws and regulations are followed. Additionally, the real-world environment may not be representative of the broader population which can introduce biases into the system. + :::{#exr-kws .callout-caution collapse="true"} ### Keyword Spotting with TensorFlow Lite Micro @@ -234,6 +236,7 @@ Many embedded use cases deal with unique situations, such as manufacturing plant While synthetic data offers numerous advantages, it is essential to use it judiciously[^synethic-balance]. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases. [^synethic-balance]: Synthetic data should be balanced with real-world data to ensure models remain reliable. If ML models are overly trained on synthetic data, the outputs may become nonsensical and the model may collapse. + :::{#exr-sd .callout-caution collapse="true"} ### Synthetic Data diff --git a/contents/core/frameworks/frameworks.qmd b/contents/core/frameworks/frameworks.qmd index 908d1b3f..5f223019 100644 --- a/contents/core/frameworks/frameworks.qmd +++ b/contents/core/frameworks/frameworks.qmd @@ -38,7 +38,9 @@ Furthermore, we investigate the specialization of frameworks tailored to specifi Machine learning frameworks provide the tools and infrastructure to efficiently build, train, and deploy machine learning models. In this chapter, we will explore the evolution and key capabilities of major frameworks like [TensorFlow (TF)](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), and specialized frameworks for embedded devices. We will dive into the components like computational graphs, optimization algorithms, hardware acceleration, and more that enable developers to construct performant models quickly. Understanding these frameworks is essential to leverage the power of deep learning across the spectrum from cloud to edge devices. -ML frameworks handle much of the complexity of model development through high-level APIs and domain-specific languages that allow practitioners to quickly construct models by combining pre-made components and abstractions. For example, frameworks like TensorFlow and PyTorch provide Python APIs to define neural network architectures using layers, optimizers, datasets, and more. This enables rapid iteration compared to coding every model detail from scratch. +ML frameworks handle much of the complexity of model development through high-level APIs and domain-specific languages that allow practitioners to quickly construct models by combining pre-made components and abstractions. For example, frameworks like **TensorFlow**[^tensor_frame] and PyTorch provide Python APIs to define neural network architectures using layers, optimizers, datasets, and more. This enables rapid iteration compared to coding every model detail from scratch. + +[^tensor_frame]: Google's open-source ML framework TensorFlow is popular for production deployment. They offer mobile support through TensorFlow Lite, their lightweight version suitable for embedded devices. Meta AI's PyTorch is is commonly used in research and considered easier for beginner Python developers due to its dynamic computational graphs and intuitive debugging. A key capability offered by these frameworks is distributed training engines that can scale model training across clusters of GPUs and TPUs. This makes it feasible to train state-of-the-art models with billions or trillions of parameters on vast datasets. Frameworks also integrate with specialized hardware like NVIDIA GPUs to further accelerate training via optimizations like parallelization and efficient matrix operations. @@ -58,7 +60,7 @@ The first ML frameworks, [Theano](https://pypi.org/project/Theano/#:~:text=Thean Many of these ML frameworks can be divided into high-level vs. low-level frameworks and static vs. dynamic computational graph frameworks. High-level frameworks provide a higher level of abstraction than low-level frameworks. High-level frameworks have pre-built functions and modules for common ML tasks, such as creating, training, and evaluating common ML models, preprocessing data, engineering features, and visualizing data, which low-level frameworks do not have. Thus, high-level frameworks may be easier to use but are less customizable than low-level frameworks (i.e., users of low-level frameworks can define custom layers, loss functions, optimization algorithms, etc.). Examples of high-level frameworks include TensorFlow/Keras and PyTorch. Examples of low-level ML frameworks include TensorFlow with low-level APIs, Theano, Caffe, Chainer, and CNTK. -Frameworks like Theano and Caffe used static computational graphs, which required defining the full model architecture upfront, thus limiting flexibility. In contract, dynamic graphs are constructed on the fly for more iterative development. Around 2016, frameworks like PyTorch and TensorFlow 2.0 began adopting dynamic graphs, providing greater flexibility for model development. We will discuss these concepts and details later in the AI Training section. +Frameworks like Theano and Caffe used static computational graphs, which required defining the full model architecture upfront, thus limiting flexibility. In contrast, dynamic graphs are constructed on the fly for more iterative development. Around 2016, frameworks like PyTorch and TensorFlow 2.0 began adopting dynamic graphs, providing greater flexibility for model development. We will discuss these concepts and details later in the AI Training section. The development of these frameworks facilitated an explosion in model size and complexity over time---from early multilayer perceptrons and convolutional networks to modern transformers with billions or trillions of parameters. In 2016, ResNet models by @he2016deep achieved record ImageNet accuracy with over 150 layers and 25 million parameters. Then, in 2020, the GPT-3 language model from OpenAI [@brown2020language] pushed parameters to an astonishing 175 billion using model parallelism in frameworks to train across thousands of GPUs and TPUs. @@ -86,7 +88,10 @@ Today's advanced frameworks enable practitioners to develop and deploy increasin ## Deep Dive into TensorFlow {#sec-deep_dive_into_tensorflow} -TensorFlow was developed by the Google Brain team and was released as an open-source software library on November 9, 2015. It was designed for numerical computation using data flow graphs and has since become popular for a wide range of machine learning and deep learning applications. +TensorFlow was developed by the Google Brain team and was released as an open-source software library on November 9, 2015. It was designed for numerical computation using **data flow graph**[^data-flow] and has since become popular for a wide range of **machine learning and deep learning**[^ml-dl] applications. + +[^data-flow]: In data flow graphs, nodes represent operations, and edges represent the data flowing between these operations. +[^ml-dl]: Machine learning is a broader concept within artificial intelligence that uses techniques to enable computers to make decisions or predictions based on what they have learned from data. Deep learning is a subset of ML that works with deep neural networks. Deep learning is a hierarchical framework, meaning it learns lower-lower features (like edges in images) before higher-level features (like shapes or objects). It can understand complex patterns from vast amounts of data with less manual engineering compared to traditional ML approaches. Other subsets of ML are supervised, unsupervised, and reinforcement learning. TensorFlow is a training and inference framework that provides built-in functionality to handle everything from model creation and training to deployment, as shown in @fig-tensorflow-architecture. Since its initial development, the TensorFlow ecosystem has grown to include many different "varieties" of TensorFlow, each intended to allow users to support ML on different platforms. In this section, we will mainly discuss only the core package. @@ -102,7 +107,9 @@ TensorFlow is a training and inference framework that provides built-in function 5. [TensorFlow on Edge Devices (Coral)](https://developers.googleblog.com/2019/03/introducing-coral-our-platform-for.html): platform of hardware components and software tools from Google that allows the execution of TensorFlow models on edge devices, leveraging Edge TPUs for acceleration. -6. [TensorFlow Federated (TFF)](https://www.tensorflow.org/federated): framework for machine learning and other computations on decentralized data. TFF facilitates federated learning, allowing model training across many devices without centralizing the data. +6. [TensorFlow Federated (TFF)](https://www.tensorflow.org/federated): framework for machine learning and other computations on decentralized data. TFF facilitates **federated learning**[^fed-learn], allowing model training across many devices without centralizing the data. + +[^fed-learn]: In federated learning, multiple entities (referred to as clients) train a model on their local datasets which ensures their data remains decentralized. This technique in ML is motivated by issues such as data privacy and data minimization. The assumption that the data is independently and identically distributed is no longer valid in federated learning which may cause biased local models. 7. [TensorFlow Graphics](https://www.tensorflow.org/graphics): library for using TensorFlow to carry out graphics-related tasks, including 3D shapes and point clouds processing, using deep learning. @@ -145,7 +152,9 @@ DistBelief and its architecture defined above were crucial in enabling distribut ### Static Computation Graph -Model parameters are distributed across various parameter servers in the parameter server architecture. Since DistBelief was primarily designed for the neural network paradigm, parameters corresponded to a fixed neural network structure. If the computation graph were dynamic, the distribution and coordination of parameters would become significantly more complicated. For example, a change in the graph might require the initialization of new parameters or the removal of existing ones, complicating the management and synchronization tasks of the parameter servers. This made it harder to implement models outside the neural framework or models that required dynamic computation graphs. +Model parameters are distributed across various parameter servers in the parameter server architecture. Since DistBelief was primarily designed for the neural network paradigm, parameters corresponded to a fixed neural network structure. If the **computation graphs**[^dynamic-comp-graph] were dynamic, the distribution and coordination of parameters would become significantly more complicated. For example, a change in the graph might require the initialization of new parameters or the removal of existing ones, complicating the management and synchronization tasks of the parameter servers. This made it harder to implement models outside the neural framework or models that required dynamic computation graphs. + +[^dynamic-comp-graph]: Computation graphs are used to visualize the sequence of operations in a given model and to facilitate automatic differentiation which trains models through backpropagation. TensorFlow was designed as a more general computation framework that expresses computation as a data flow graph. This allows for a wider variety of machine learning models and algorithms outside of neural networks and provides flexibility in refining models. @@ -159,7 +168,7 @@ TensorFlow was built to run on multiple platforms, from mobile devices and edge Rather than using the parameter server architecture, TensorFlow deploys tasks across a cluster. These tasks are named processes that can communicate over a network, and each can execute TensorFlow's core construct, the dataflow graph, and interface with various computing devices (like CPUs or GPUs). This graph is a directed representation where nodes symbolize computational operations, and edges depict the tensors (data) flowing between these operations. -Despite the absence of traditional parameter servers, some "PS tasks" still store and manage parameters reminiscent of parameter servers in other systems. The remaining tasks, which usually handle computation, data processing, and gradient calculations, are referred to as "worker tasks." TensorFlow's PS tasks can execute any computation representable by the dataflow graph, meaning they aren't just limited to parameter storage, and the computation can be distributed. This capability makes them significantly more versatile and gives users the power to program the PS tasks using the standard TensorFlow interface, the same one they'd use to define their models. As mentioned above, dataflow graphs' structure also makes them inherently good for parallelism, allowing for the processing of large datasets. +Despite the absence of traditional parameter servers, some "PS tasks" still store and manage parameters reminiscent of parameter servers in other systems. The remaining tasks, which usually handle computation, data processing, and gradient calculations, are referred to as "worker tasks". TensorFlow's PS tasks can execute any computation representable by the dataflow graph, meaning they aren't just limited to parameter storage, and the computation can be distributed. This capability makes them significantly more versatile and gives users the power to program the PS tasks using the standard TensorFlow interface, the same one they'd use to define their models. As mentioned above, dataflow graphs' structure also makes them inherently good for parallelism, allowing for the processing of large datasets. ### Built-in Functionality & Keras @@ -337,9 +346,11 @@ We implicitly construct a computational graph when defining a neural network arc * Automatic differentiation for training -* Language agnosticism - graph can be translated to run on GPUs, TPUs, etc. +* **Language agnosticism**[^lang-agnos] - a graph can be translated to run on GPUs, TPUs, etc. + +[^lang-agnos]: By having computational graphs be language agnostic, models can be developed in one language and deployed in another. This enhances the usability of these models across different ecosystems. -* Portability - graph can be serialized, saved, and restored later +* Portability - a graph can be serialized, saved, and restored later Computational graphs are the fundamental building blocks of ML frameworks. Model definition via high-level abstractions creates a computational graph—the layers, activations, and architectures we use become graph nodes and edges. The framework compilers and optimizers operate on this graph to generate executable code. The abstractions provide a developer-friendly API for building computational graphs. Under the hood, it's still graphs down! So, while you may not directly manipulate graphs as a framework user, they enable your high-level model specifications to be efficiently executed. The abstractions simplify model-building, while computational graphs make it possible. @@ -358,7 +369,7 @@ y = tf.matmul(x, weights) + biases In this example, x is a placeholder for input data, and y is the result of a matrix multiplication operation followed by an addition. The model is defined in this declaration phase, where all operations and variables must be specified upfront. -Once the entire graph is defined, the framework compiles and optimizes it. This means that the computational steps are set in stone, and the framework can apply various optimizations to improve efficiency and performance. When you later execute the graph, you provide the actual input tensors, and the pre-defined operations are carried out in the optimized sequence. +Once the entire graph is defined, the framework compiles and optimizes it. This means that the computational steps are set in stone, and the framework can apply various optimizations to improve efficiency and performance. When you later execute the graph, you provide the actual input tensors, and the predefined operations are carried out in the optimized sequence. This approach is similar to building a blueprint where every detail is planned before construction begins. While this allows for powerful optimizations, it also means that any changes to the model require redefining the entire graph from scratch. @@ -371,7 +382,9 @@ x = torch.randn(4,784) y = torch.matmul(x, weights) + biases ``` -The above example does not have separate compile/build/run phases. Ops define and execute immediately. With dynamic graphs, the definition is intertwined with execution, providing a more intuitive, interactive workflow. However, the downside is that there is less potential for optimization since the framework only sees the graph as it is built. @fig-static-vs-dynamic demonstrates the differences between a static and dynamic computation graph. +The above example does not have separate compile/build/run phases. Operations are **defined and executed**[^def-exec] immediately. With dynamic graphs, the definition is intertwined with execution, providing a more intuitive, interactive workflow. However, the downside is that there is less potential for optimization since the framework only sees the graph as it is built. @fig-static-vs-dynamic demonstrates the differences between a static and dynamic computation graph. + +[^def-exec]: Operations are defined to specify the computational graph's structure and subsequently the model's architecture. This can include how data will move through the model, data transformations, and what types of outputs are expected. ![Comparing static and dynamic graphs. Source: [Dev](https://www.google.com/url?sa=i&url=https%3A%2F%2Fdev-jm.tistory.com%2F4&psig=AOvVaw0r1cZbZa6iImYP-fesrN7H&ust=1722533107591000&source=images&cd=vfe&opi=89978449&ved=0CBQQjhxqFwoTCLC8nYHm0YcDFQAAAAAdAAAAABAY)](images/png/staticvsdynamic.png){#fig-static-vs-dynamic} @@ -382,7 +395,7 @@ Recently, the distinction has blurred as frameworks adopt both modes. TensorFlow +:===================================+:====================================================+:===========================================================+ | Static (Declare-then-execute) | - Enable graph optimizations by seeing full model | - Less flexible for research and iteration | | | ahead of time | - Changes require rebuilding graph | -| | - Can export and deploy frozen graphs | - Execution has separate compile and run phases | +| | - Can export and deploy **frozen graphs**[^frozen] | - Execution has separate compile and run phases | | | - Graph is packaged independently of code | | +------------------------------------+-----------------------------------------------------+------------------------------------------------------------+ | Dynamic (Define-by-run) | - Intuitive imperative style like Python code | - Harder to optimize without full graph | @@ -393,6 +406,8 @@ Recently, the distinction has blurred as frameworks adopt both modes. TensorFlow : Comparison between Static (Declare-then-execute) and Dynamic (Define-by-run) Execution Graphs, highlighting their respective pros and cons. {#tbl-exec-graph .striped .hover} +[^frozen]: In frozen graphs, all variables are treated as constants, so no further training is possible. The file sizes are smaller, making them suitable for edge devices. These graphs are optimized for inference which results in faster execution times, lower memory usage, and more predictable performance. + ### Data Pipeline Tools Computational graphs can only be as good as the data they learn from and work on. Therefore, feeding training data efficiently is crucial for optimizing deep neural network performance, though it is often overlooked as one of the core functionalities. Many modern AI frameworks provide specialized pipelines to ingest, process, and augment datasets for model training. @@ -400,7 +415,7 @@ Computational graphs can only be as good as the data they learn from and work on #### Data Loaders {#sec-frameworks-data-loaders} -At the core of these pipelines are data loaders, which handle reading training examples from sources like files, databases, and object storage. Data loaders facilitate efficient data loading and preprocessing, crucial for deep learning models. For instance, TensorFlow's [tf.data](https://www.tensorflow.org/guide/data) dataloading pipeline is designed to manage this process. Depending on the application, deep learning models require diverse data formats such as CSV files or image folders. Some popular formats include: +At the core of these pipelines are data loaders, which handle reading training examples from sources like files, databases, and object storage. Data loaders facilitate efficient data loading and preprocessing, crucial for deep learning models. For instance, TensorFlow's [tf.data](https://www.tensorflow.org/guide/data) data loading pipeline is designed to manage this process. Depending on the application, deep learning models require diverse data formats such as CSV files or image folders. Some popular formats include: * **CSV**: A versatile, simple format often used for tabular data. @@ -432,7 +447,9 @@ These hands-off data pipelines represent a significant improvement in usability Training a neural network is fundamentally an iterative process that seeks to minimize a loss function. The goal is to fine-tune the model weights and parameters to produce predictions close to the true target labels. Machine learning frameworks have greatly streamlined this process by offering loss functions and optimization algorithms. -Machine learning frameworks provide implemented loss functions that are needed for quantifying the difference between the model's predictions and the true values. Different datasets require a different loss function to perform properly, as the loss function tells the computer the "objective" for it to aim. Commonly used loss functions include Mean Squared Error (MSE) for regression tasks, Cross-Entropy Loss for classification tasks, and Kullback-Leibler (KL) Divergence for probabilistic models. For instance, TensorFlow's [tf.keras.losses](https://www.tensorflow.org/api_docs/python/tf/keras/losses) holds a suite of these commonly used loss functions. +Machine learning frameworks provide implemented loss functions that are needed for quantifying the difference between the model's predictions and the true values. Different datasets require a different loss function to perform properly, as the loss function tells the computer the "objective" for it to aim. Commonly used loss functions include Mean Squared Error (MSE) for regression tasks, Cross-Entropy Loss for classification tasks, and Kullback-Leibler (KL) Divergence for probabilistic models.[^specific-loss] For instance, TensorFlow's [tf.keras.losses](https://www.tensorflow.org/api_docs/python/tf/keras/losses) holds a suite of these commonly used loss functions. + +[^specific-loss]: Loss functions are chosen based on the type of tasks and the characteristics of the dataset. If an unsuitable loss function is used, poor model performance or misaligned optimization objectives can occur. This can cause slow convergence, failure to learn relevant features, or poor generalization to new data. Optimization algorithms are used to efficiently find the set of model parameters that minimize the loss function, ensuring the model performs well on training data and generalizes to new data. Modern frameworks come equipped with efficient implementations of several optimization algorithms, many of which are variants of gradient descent with stochastic methods and adaptive learning rates. Some examples of these variants are Stochastic Gradient Descent, Adagrad, Adadelta, and Adam. The implementation of such variants are provided in [tf.keras.optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers). More information with clear examples can be found in the AI Training section. @@ -446,9 +463,11 @@ The next critical step is memory allocation. Essential memory is reserved for th The training process employs various tools to improve efficiency. Batch processing is commonly used to maximize computational throughput. Techniques like vectorization enable operations on entire data arrays rather than proceeding element-wise, which bolsters speed. Optimizations such as kernel fusion (refer to the Optimizations chapter) amalgamate multiple operations into a single action, minimizing computational overhead. Operations can also be segmented into phases, facilitating the concurrent processing of different mini-batches at various stages. -Frameworks consistently checkpoint the state, preserving intermediate model versions during training. This ensures that progress is recovered if an interruption occurs, and training can be recommenced from the last checkpoint. Additionally, the system vigilantly monitors the model's performance against a validation data set. Should the model begin to overfit (if its performance on the validation set declines), training is automatically halted, conserving computational resources and time. +Frameworks consistently checkpoint the state, preserving intermediate model versions during training. This ensures that progress is saved if an interruption occurs, and training can be recommenced from the last checkpoint. Additionally, the system vigilantly monitors the model's performance against a validation dataset. Should the model begin to overfit (if its performance on the validation set declines), training is automatically halted, conserving computational resources and time. + +ML frameworks incorporate a blend of model compilation, enhanced batch processing methods, and utilities such as checkpointing and **early stopping**.[^early-stop] These resources manage the complex aspects of performance, enabling practitioners to zero in on model development and training. As a result, developers experience both speed and ease when utilizing neural networks' capabilities. -ML frameworks incorporate a blend of model compilation, enhanced batch processing methods, and utilities such as checkpointing and early stopping. These resources manage the complex aspects of performance, enabling practitioners to zero in on model development and training. As a result, developers experience both speed and ease when utilizing neural networks' capabilities. +[^early-stop]: Early stopping is a ML technique used to prevent overfitting by stopping a model's training when the performance on the validation dataset starts to decline. ### Validation and Analysis @@ -498,7 +517,10 @@ We can use four primary methods to make computers take derivatives. First, we ca ### Hardware Acceleration -The trend to continuously train and deploy larger machine-learning models has made hardware acceleration support necessary for machine-learning platforms. @fig-hardware-accelerator shows the large number of companies that are offering hardware accelerators in different domains, such as "Very Low Power" and "Embedded" machine learning. Deep layers of neural networks require many matrix multiplications, which attract hardware that can compute matrix operations quickly and in parallel. In this landscape, two hardware architectures, the [GPU and TPU](https://cloud.google.com/tpu/docs/intro-to-tpu), have emerged as leading choices for training machine learning models. +The trend to continuously train and deploy larger machine-learning models has made hardware acceleration support necessary for machine-learning platforms. @fig-hardware-accelerator shows the large number of companies that are offering **hardware accelerators**[^hard-accel] in different domains, such as "Very Low Power" and "Embedded" machine learning. Deep layers of neural networks require many matrix multiplications, which attract hardware that can compute matrix operations quickly and in parallel. In this landscape, two hardware architectures, the [**GPU and TPU**](https://cloud.google.com/tpu/docs/intro-to-tpu)[^gpu-cpu], have emerged as leading choices for training machine learning models. + +[^hard-accel]: Hardware accelerators are specialized systems that perform computing tasks more efficiently than central processing units (CPUs). These accelerators speed up the computation by allowing greater concurrency, optimized matrix operations, simpler control logic, and dedicated memory architecture. Each processing unit is more specialized than a CPU core, so more units can be fit on a chip and run in unison. +[^gpu-cpu]: GPUs are designed for rendering graphics and is heavily used for parallel processing. TPUs were developed by Google for fast matrix multiplication and deep learning tasks. The use of hardware accelerators began with [AlexNet](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf), which paved the way for future works to use GPUs as hardware accelerators for training computer vision models. GPUs, or Graphics Processing Units, excel in handling many computations at once, making them ideal for the matrix operations central to neural network training. Their architecture, designed for rendering graphics, is perfect for the mathematical operations required in machine learning. While they are very useful for machine learning tasks and have been implemented in many hardware platforms, GPUs are still general purpose in that they can be used for other applications. @@ -526,7 +548,9 @@ ML frameworks that support distributed learning include TensorFlow (through its Machine learning models have various methods to be represented and used within different frameworks and for different device types. For example, a model can be converted to be compatible with inference frameworks within the mobile device. The default format for TensorFlow models is checkpoint files containing weights and architectures, which are needed to retrain the models. However, models are typically converted to TensorFlow Lite format for mobile deployment. TensorFlow Lite uses a compact flat buffer representation and optimizations for fast inference on mobile hardware, discarding all the unnecessary baggage associated with training metadata, such as checkpoint file structures. -Model optimizations like quantization (see [Optimizations](../optimizations/optimizations.qmd) chapter) can further optimize models for target architectures like mobile. This reduces the precision of weights and activations to `uint8` or `int8` for a smaller footprint and faster execution with supported hardware accelerators. For post-training quantization, TensorFlow's converter handles analysis and conversion automatically. +Model optimizations like **quantization**[^quanti] (see [Optimizations](../optimizations/optimizations.qmd) chapter) can further optimize models for target architectures like mobile. This reduces the precision of weights and activations to `uint8` or `int8` for a smaller footprint and faster execution with supported hardware accelerators. For post-training quantization, TensorFlow's converter handles analysis and conversion automatically. + +[^quanti]: Quantization reduces the precision of weights and activations to make models smaller and faster. Frameworks like TensorFlow simplify deploying trained models to mobile and embedded IoT devices through easy conversion APIs for TFLite format and quantization. Ready-to-use conversion enables high-performance inference on mobile without a manual optimization burden. Besides TFLite, other common targets include TensorFlow.js for web deployment, TensorFlow Serving for cloud services, and TensorFlow Hub for transfer learning. TensorFlow's conversion utilities handle these scenarios to streamline end-to-end workflows. @@ -614,7 +638,9 @@ Embedded systems face severe resource constraints that pose unique challenges wh These tight constraints often make training machine learning models directly on microcontrollers infeasible. The limited RAM precludes handling large datasets for training. Energy usage for training would also quickly deplete battery-powered devices. Instead, models are trained on resource-rich systems and deployed on microcontrollers for optimized inference. But even inference poses challenges: -1. **Model Size:** AI models are too large to fit on embedded and IoT devices. This necessitates model compression techniques, such as quantization, pruning, and knowledge distillation. Additionally, as we will see, many of the frameworks used by developers for AI development have large amounts of overhead and built-in libraries that embedded systems can't support. +1. **Model Size:** AI models are too large to fit on embedded and IoT devices. This necessitates model compression techniques, such as quantization, pruning, and **knowledge distillation**[^know-distill]. Additionally, as we will see, many of the frameworks used by developers for AI development have large amounts of overhead and built-in libraries that embedded systems can't support. + +[^know-distill]: Knowledge distillation is an ML technique that transfers knowledge from a large, pre-trained model to a smaller model. While transfer learning also involves sharing knowledge between models, its goal is to reduce training time for new tasks by leveraging knowledge from similar, previously learned tasks. The new model in the transfer learning context will be similar in architecture to the pre-trained network, but the weights will be changes to accommodate the specific task. In contrast, knowledge distillation transfers the generalizations of a complex model to a smaller model rather than transferring the weights. 2. **Complexity of Tasks:** With only tens of KBs to a few MBs of RAM, IoT devices and embedded systems are constrained in the complexity of tasks they can handle. Tasks that require large datasets or sophisticated algorithms—for example, LLMs—that would run smoothly on traditional computing platforms might be infeasible on embedded systems without compression or other optimization techniques due to memory limitations. diff --git a/contents/core/ml_systems/ml_systems.qmd b/contents/core/ml_systems/ml_systems.qmd index 36c4f990..634ab155 100644 --- a/contents/core/ml_systems/ml_systems.qmd +++ b/contents/core/ml_systems/ml_systems.qmd @@ -70,13 +70,13 @@ Cloud ML leverages powerful servers in the cloud for training and running large, **Definition of Cloud ML** -Cloud Machine Learning (Cloud ML) is a subfield of machine learning that leverages the power and scalability of **cloud** [^defn-cloud] computing infrastructure to develop, train, and deploy machine learning models. By utilizing the vast computational resources available in the cloud, Cloud ML enables the efficient handling of large-scale datasets and complex machine learning algorithms. +Cloud Machine Learning (Cloud ML) is a subfield of machine learning that leverages the power and scalability of **cloud**[^defn-cloud] computing infrastructure to develop, train, and deploy machine learning models. By utilizing the vast computational resources available in the cloud, Cloud ML enables the efficient handling of large-scale datasets and complex machine learning algorithms. [^defn-cloud]: The cloud refers to remote computing servers that offers the storage, compute, and services used by these ML models. **Centralized Infrastructure** -One of the key characteristics of Cloud ML is its centralized infrastructure. @fig-cloudml-example illustrates this concept with an example from Google's Cloud TPU[^defn-tpu] data center. Cloud service providers offer a **virtual platform** [^defn-virplat] that consists of high-capacity servers, expansive storage solutions, and robust networking architectures, all housed in data centers distributed across the globe. As shown in the figure, these centralized facilities can be massive in scale, housing rows upon rows of specialized hardware. This centralized setup allows for the pooling and efficient management of computational resources, making it easier to scale machine learning projects as needed.[^ci-lim] +One of the key characteristics of Cloud ML is its centralized infrastructure. @fig-cloudml-example illustrates this concept with an example from Google's Cloud TPU[^defn-tpu] data center. Cloud service providers offer a **virtual platform**[^defn-virplat] that consists of high-capacity servers, expansive storage solutions, and robust networking architectures, all housed in data centers distributed across the globe. As shown in the figure, these centralized facilities can be massive in scale, housing rows upon rows of specialized hardware. This centralized setup allows for the pooling and efficient management of computational resources, making it easier to scale machine learning projects as needed.[^ci-lim] [^defn-tpu]: Tensor Processing Units (TPUs) are Google's custom-designed AI accelerator which are optimized for high-volume computations in deep learning tasks. [^defn-virplat]: Virtual platforms allows the complex details of the physical hardware to be hidden from the user while the user interacts with software interfaces that simulate hardware. This is often done through virtual machines or containers. By abstracting hardware into a virtual platform, the cloud will manage resources more efficiently. The cloud system can adjust the allocation between multiple users automatically without getting the users involved. @@ -161,9 +161,10 @@ Cloud ML has found widespread adoption across various domains, revolutionizing t **Virtual Assistants** -Cloud ML plays a crucial role in powering virtual assistants like Siri and Alexa. These systems leverage the immense computational capabilities of the cloud to process and analyze voice inputs in real-time. By harnessing the power of natural language processing and machine learning algorithms, virtual assistants can understand user queries, extract relevant information, and generate intelligent and personalized responses. The cloud's scalability and processing power enable these assistants to handle a vast number of user interactions simultaneously, providing a seamless and responsive user experience. [^virt-ass] +Cloud ML plays a crucial role in powering virtual assistants like Siri and Alexa. These systems leverage the immense computational capabilities of the cloud to process and analyze voice inputs in real-time. By harnessing the power of natural language processing and machine learning algorithms, virtual assistants can understand user queries, extract relevant information, and generate intelligent and personalized responses. The cloud's scalability and processing power enable these assistants to handle a vast number of user interactions simultaneously, providing a seamless and responsive user experience.[^virt-ass] [^virt-ass]: Virtual assistants demonstrate a hybrid approach of using both Tiny ML and Cloud ML. They use Tiny ML for the local processing of wake words such as "Hey Siri" which can be done offline and preserves battery. However, the complex queries, natural language comprehension, and accurate responses of these assistants requires Cloud ML to step it. + **Recommendation Systems** Cloud ML forms the backbone of advanced recommendation systems used by platforms like Netflix and Amazon. These systems use the cloud's ability to process and analyze massive datasets to uncover patterns, preferences, and user behavior. By leveraging collaborative filtering and other machine learning techniques, recommendation systems can offer personalized content or product suggestions tailored to each user's interests. The cloud's scalability allows these systems to continuously update and refine their recommendations based on the ever-growing amount of user data, enhancing user engagement and satisfaction. @@ -225,6 +226,7 @@ However, Edge ML has its challenges. One of the main concerns is the limited com Managing a network of **edge nodes**[^defn-edge-node] can introduce complexity, especially regarding coordination, updates, and maintenance. Ensuring all nodes operate seamlessly and are up-to-date with the latest algorithms and security protocols can be a logistical challenge. [^defn-edge-node]: An edge node is a computing that acts as a bridge between a local device and the cloud. + **Security Concerns at the Edge Nodes** While Edge ML offers enhanced data privacy, edge nodes can sometimes be more vulnerable to physical and cyber-attacks. Developing robust security protocols that protect data at each node without compromising the system's efficiency remains a significant challenge in deploying Edge ML solutions.