Merge pull request #97 from zishenwan/fig_add

Adding figures for embedded_ai, ai_workflow, data_engineering chapters
harvard-edge · Dec 7, 2023 · 0fc7844 · 0fc7844
2 parents 247b51c + 95b335c
commit 0fc7844
Show file tree

Hide file tree

Showing 12 changed files with 44 additions and 29 deletions.
diff --git a/data_engineering.qmd b/data_engineering.qmd
@@ -171,7 +171,7 @@ Thus, while crowdsourcing can work well in many cases, the specialized needs of
 
 ### Synthetic Data
 
-Synthetic data generation can be useful for addressing some of the limitations of data collection. It involves creating data that wasn’t originally captured or observed, but is generated using algorithms, simulations, or other techniques to resemble real-world data. It has become a valuable tool in various fields, particularly in scenarios where real-world data is scarce, expensive, or ethically challenging to obtain (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data that is almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable.
+Synthetic data generation can be useful for addressing some of the limitations of data collection. It involves creating data that wasn’t originally captured or observed, but is generated using algorithms, simulations, or other techniques to resemble real-world data (@fig-synthetic-data). It has become a valuable tool in various fields, particularly in scenarios where real-world data is scarce, expensive, or ethically challenging to obtain (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data that is almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable.
 
 In many domains, especially emerging ones, there may not be enough real-world data available for analysis or training machine learning models. Synthetic data can fill this gap by producing large volumes of data that mimic real-world scenarios. For instance, detecting the sound of breaking glass might be challenging in security applications where a TinyML device is trying to identify break-ins. Collecting real-world data would require breaking numerous windows, which is impractical and costly.
 
@@ -185,6 +185,8 @@ Many embedded use-cases deal with unique situations, such as manufacturing plant
 
 While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases.
 
+![Enhancing real-world data with additional synthetic data for training data-hungry ML models (Source: [AnyLogic](https://www.anylogic.com/features/artificial-intelligence/synthetic-data/))](images/data_engineering/synthetic_data.jpg){#fig-synthetic-data}
+
 ## Data Storage
 
 Data sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets. Some examples are shown in @tbl-databases.
@@ -212,17 +214,19 @@ Data sourcing and data storage go hand-in-hand and it is necessary to store data
 
 The stored data is often accompanied by metadata, which is defined as 'data about data'. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, [[Hugging Face]{.underline}](https://huggingface.co/) has [[Dataset Cards]{.underline}](https://huggingface.co/docs/hub/datasets-cards). To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset\'s contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.
 
-**Data Governance:** With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.
+**Data Governance:** With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control (@fig-governance). It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.
 
-Data governance (see @fig-governance) utilizes three integrative approaches: planning and control, organizational, and risk-based.
+Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based.
 
 * **The planning and control approach**, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance.
 
 * **The organizational approach** emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance.
 
 * **The risk-based approach**, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.
 
-![Comprehensive overview of the data governance framework.](https://www.databricks.com/en-website-assets/static/b9963e8f428f6bb9e0d3fc6f7b8b9453/c742b/key-elements-of-data-governance.webp){#fig-governance}
+<!-- ![Comprehensive overview of the data governance framework.](https://www.databricks.com/en-website-assets/static/b9963e8f428f6bb9e0d3fc6f7b8b9453/c742b/key-elements-of-data-governance.webp){#fig-governance} -->
+
+![An overview of the data governance framework (Source: [StarCIO](https://www.groundwatergovernance.org/the-importance-of-governance-for-all-stakeholders/))](images/data_engineering/data_governance.jpg){#fig-governance}
 
 Some examples of data governance across different sectors include:
 

diff --git a/embedded_ml.qmd b/embedded_ml.qmd
@@ -22,9 +22,9 @@ Before delving into the intricacies of TinyML, it's crucial to grasp the distinc
 
 ## Introduction
 
-ML is rapidly evolving, with new paradigms emerging that are reshaping how these algorithms are developed, trained, and deployed. In particular, the area of embedded machine learning is experiencing significant innovation, driven by the proliferation of smart sensors, edge devices, and microcontrollers. This chapter explores the landscape of embedded machine learning, covering the key approaches of Cloud ML, Edge ML, and TinyML.
+ML is rapidly evolving, with new paradigms emerging that are reshaping how these algorithms are developed, trained, and deployed. In particular, the area of embedded machine learning is experiencing significant innovation, driven by the proliferation of smart sensors, edge devices, and microcontrollers. This chapter explores the landscape of embedded machine learning, covering the key approaches of Cloud ML, Edge ML, and TinyML (@fig-cloud-edge-tinyml-comparison).
 
-![Cloud vs. Edge vs. TinyML: The Spectrum of Distributed Intelligence](images/cloud-edge-tiny.png)
+![Cloud vs. Edge vs. TinyML: The Spectrum of Distributed Intelligence](images/cloud-edge-tiny.png){#fig-cloud-edge-tinyml-comparison}
 
 We begin by outlining the features or characteristics, benefits, challenges, and use cases for each embedded ML variant. This provides context on where these technologies do well and where they face limitations. We then bring all three approaches together into a comparative analysis, evaluating them across critical parameters like latency, privacy, computational demands, and more. This side-by-side perspective highlights the unique strengths and tradeoffs involved in selecting among these strategies.
 
@@ -38,11 +38,11 @@ By the end of this multipronged exploration of embedded ML, you will possess the
 
 Cloud ML is a specialized branch of the broader machine learning field that operates within cloud computing environments. It offers a virtual platform for the development, training, and deployment of machine learning models, providing both flexibility and scalability.
 
-At its foundation, Cloud ML utilizes a powerful blend of high-capacity servers, expansive storage solutions, and robust networking architectures, all located in data centers around the world. This setup centralizes computational resources, simplifying the management and scaling of machine learning projects.
+At its foundation, Cloud ML utilizes a powerful blend of high-capacity servers, expansive storage solutions, and robust networking architectures, all located in data centers around the world (@fig-cloudml-example). This setup centralizes computational resources, simplifying the management and scaling of machine learning projects.
 
 The cloud environment excels in data processing and model training, designed to manage large data volumes and complex computations. Models crafted in Cloud ML can leverage vast amounts of data, processed and analyzed centrally, thereby enhancing the model's learning and predictive performance.
 
-![Cloud ML Example: Google Tensor Pods (Source: [InfoWorld](https://www.infoworld.com/article/3197331/googles-new-tpus-are-here-to-accelerate-ai-training.html))](images/imgs_embedded_ml/cloud_ml_tpu.jpg)
+![Cloud ML Example: Cloud TPU accelerator supercomputers in google data center (Source: [Google](https://blog.google/technology/ai/google-gemini-ai/#scalable-efficient))](images/imgs_embedded_ml/cloud_ml_tpu.jpg){#fig-cloudml-example}
 
 ### Benefits
 
@@ -82,12 +82,14 @@ Edge Machine Learning (Edge ML) is the practice of running machine learning algo
 
 **Decentralized Data Processing**
 
-In Edge ML, data processing happens in a decentralized fashion. Instead of sending data to remote servers, the data is processed locally on devices like smartphones, tablets, or IoT devices. This local processing allows devices to make quick decisions based on the data they collect, without having to rely heavily on a central server's resources. This decentralization is particularly important in real-time applications where even a slight delay can have significant consequences.
+In Edge ML, data processing happens in a decentralized fashion. Instead of sending data to remote servers, the data is processed locally on devices like smartphones, tablets, or IoT devices (@fig-edgeml-example). This local processing allows devices to make quick decisions based on the data they collect, without having to rely heavily on a central server's resources. This decentralization is particularly important in real-time applications where even a slight delay can have significant consequences.
 
 **Local Data Storage and Computation**
 
 Local data storage and computation are key features of Edge ML. This setup ensures that data can be stored and analyzed directly on the devices, thereby maintaining the privacy of the data and reducing the need for constant internet connectivity. Moreover, this often leads to more efficient computation, as data doesn't have to travel long distances, and computations are performed with a more nuanced understanding of the local context, which can sometimes result in more insightful analyses.
 
+![Edge ML Example: Data is processed locally on Internet of Things (IoT) devices (Source: [Edge Impulse](https://docs.edgeimpulse.com/docs/concepts/what-is-edge-machine-learning))](images/imgs_embedded_ml/edge_ml_iot.jpg){#fig-edgeml-example}
+
 ### Benefits
 
 **Reduced Latency**
@@ -140,7 +142,7 @@ The applicability of Edge ML is vast and not limited to these examples. Various
 
 **Definition of TinyML**
 
-TinyML sits at the crossroads of embedded systems and machine learning, representing a burgeoning field that brings smart algorithms directly to tiny microcontrollers and sensors. These microcontrollers operate under severe resource constraints, particularly in terms of memory, storage, and computational power.
+TinyML sits at the crossroads of embedded systems and machine learning, representing a burgeoning field that brings smart algorithms directly to tiny microcontrollers and sensors. These microcontrollers operate under severe resource constraints, particularly in terms of memory, storage, and computational power (see a TinyML kit example in @fig-tinyml-example).
 
 **On-Device Machine Learning**
 
@@ -150,6 +152,8 @@ In TinyML, the focus is on on-device machine learning. This means that machine l
 
 TinyML excels in low-power and resource-constrained settings. These environments require solutions that are highly optimized to function within the available resources. TinyML meets this need through specialized algorithms and models designed to deliver decent performance while consuming minimal energy, thus ensuring extended operational periods, even in battery-powered devices.
 
+![Tiny ML Example: (Left) A TinyML kit that includes Arduino Nano 33 BLE Sense, an OV7675 camera module, and TinyML shield. (Right) The Nano 33 BLE includes a host of onboard integrated sensors, a Bluetooth Low Energy module, and an Arm Cortex-M microcontroller that can run neural-network models using TensorFlow Lite for Microcontrollers. (Source: [Widening Access to Applied Machine Learning with TinyML](https://arxiv.org/pdf/2106.04008.pdf)))](images/imgs_embedded_ml/tiny_ml.jpg){#fig-tinyml-example}
+
 ### Benefits
 
 **Extremely Low Latency**