diff --git a/contents/about.qmd b/contents/about.qmd index 5021926d..8d06c17a 100644 --- a/contents/about.qmd +++ b/contents/about.qmd @@ -6,30 +6,28 @@ comments: false ## Overview -Welcome to this collaborative project initiated by the CS249r Machine Learning Systems class at Harvard University. Our goal is to make this book a community resource that assists educators and learners in understanding ML systems. The book will be regularly updated to reflect new insights into ML systems and effective teaching methods. +Welcome to this collaborative textbook, developed as part of the CS249r Machine Learning Systems class at Harvard University. Our goal is to provide a comprehensive resource for educators and students seeking to understand machine learning systems. This book is continually updated to incorporate the latest insights and effective teaching strategies. -## Topics Explored - -This book offers a comprehensive look at various aspects of machine learning systems. We cover the entire end-to-end ML systems workflow, starting with fundamental concepts and progressing through data engineering, AI frameworks, and model training. +## What's Inside the Book -You'll learn about optimizing models for efficiency, deploying AI on various hardware platforms, and benchmarking performance. The book also explores more advanced topics like security, privacy, responsible and sustainable AI, robust and generative AI, and the social impact of AI. By the end, you'll have a solid foundation and practical insights into both the technical and ethical dimensions of machine learning. +We explore the technical foundations of machine learning systems, the challenges of building and deploying these systems across the computing continuum, and the vast array of applications they enable. A unique aspect of this book is its function as a conduit to seminal scholarly works and academic research papers, aimed at enriching the reader's understanding and encouraging deeper exploration of the subject. This approach seeks to bridge the gap between pedagogical materials and cutting-edge research trends, offering a comprehensive guide that is in step with the evolving field of applied machine learning. -By the time you finish this book, we hope that you'll have a foundational understanding of machine learning and its applications. You'll also learn about real-world implementations of machine learning systems and gain practical experience through project-based labs and assignments. - -## Who Should Read This +To improve the learning experience, we have included a variety of supplementary materials. Throughout the book, you will find slides that summarize key concepts, videos that provide in-depth explanations and demonstrations, exercises that reinforce your understanding, and labs that offer hands-on experience with the tools and techniques discussed. These additional resources are designed to cater to different learning styles and help you gain a deeper, more practical understanding of the subject matter. -This book is tailored for individuals at various stages in their interaction with machine learning systems. It starts with the fundamentals and progresses to more advanced topics pertinent to the ML community and broader research areas. The most relevant audiences include: +## Topics Explored -* **Students in Computer Science and Electrical Engineering:** Senior and graduate students in these fields will find this book invaluable. It introduces the techniques used in designing and building ML systems, focusing on fundamentals rather than depth—typically the focus of classroom instruction. This book aims to provide the necessary background and context, enabling instructors to delve deeper into advanced topics. An important aspect is the end-to-end focus, often overlooked in traditional curricula. +This textbook offers a comprehensive exploration of various aspects of machine learning systems, covering the entire end-to-end workflow. Starting with foundational concepts, it progresses through essential areas such as data engineering, AI frameworks, and model training. -* **Systems Engineers:** For engineers, this book serves as a guide to understanding the challenges of intelligent applications, especially on resource-constrained ML platforms. It covers the conceptual framework and practical components that constitute an ML system, extending beyond specific areas you might specialize in at your job. +To enhance the learning experience, we included a diverse array of supplementary materials. These resources consist of slides that summarize key concepts, videos providing detailed explanations and demonstrations, exercises designed to reinforce understanding, and labs that offer hands-on experience with the discussed tools and techniques. -* **Researchers and Academics:** Researchers will find that this book addresses the unique challenges of running machine learning algorithms on diverse platforms. Efficiency is becoming increasingly important; understanding algorithms alone is not sufficient, as a deeper understanding of systems is necessary to build more efficient models. For researchers, the book cites seminal papers, guiding you towards foundational works that have shaped the field and drawing connections between various areas with significant implications for your work. +Readers will gain insights into optimizing models for efficiency, deploying AI across different hardware platforms, and benchmarking performance. The book also delves into advanced topics, including security, privacy, responsible and sustainable AI, robust AI, and generative AI. Additionally, it examines the social impact of AI, concluding with an emphasis on the positive contributions AI can make to society. ## Key Learning Outcomes Readers will acquire skills in training and deploying deep neural network models on various platforms, along with understanding the broader challenges involved in their design, development, and deployment. Specifically, after completing this book, learners will be able to: +::: {.callout-tip} + 1. Explain core concepts and their relevance to AI systems. 2. Describe the fundamental components and architecture of AI systems. @@ -50,6 +48,8 @@ Readers will acquire skills in training and deploying deep neural network models 10. Critically assess the ethical implications and societal impacts of AI systems. +::: + ## Prerequisites for Readers * **Basic Programming Skills:** We recommend that you have some prior programming experience, ideally in Python. A grasp of variables, data types, and control structures will make it easier to engage with the book. @@ -65,3 +65,128 @@ Readers will acquire skills in training and deploying deep neural network models * **Resource Availability:** For the hands-on aspects, you'll need a computer with Python and the relevant libraries installed. Optional access to development boards or specific hardware will also be beneficial for experimenting with machine learning model deployment. By meeting these prerequisites, you'll be well-positioned to deepen your understanding of machine learning systems, engage in coding exercises, and even implement practical applications on various devices. + +## Who Should Read This + +This book is designed for individuals at different stages of their journey with machine learning systems, from beginners to those more advanced in the field. It introduces fundamental concepts and progresses to complex topics relevant to the machine learning community and expansive research areas. The key audiences for this book include: + +* **Students in Computer Science and Electrical Engineering:** Senior and graduate students will find this book particularly valuable. It introduces the techniques essential for designing and building ML systems, focusing on foundational knowledge rather than exhaustive detail---often the focus of classroom instruction. This book will provide the necessary background and context, enabling instructors to explore advanced topics more deeply. An essential feature is its end-to-end perspective, which is often overlooked in traditional curricula. + +* **Systems Engineers:** This book serves as a guide for engineers seeking to understand the complexities of intelligent systems and applications, particularly involving ML. It encompasses the conceptual frameworks and practical components that comprise an ML system, extending beyond the specific areas you might encounter in your professional role. + +* **Researchers and Academics:** For researchers, this book addresses the distinct challenges of executing machine learning algorithms across diverse platforms. As efficiency gains importance, a robust understanding of systems, beyond algorithms alone, is crucial for developing more efficient models. The book references seminal papers, directing researchers to works that have influenced the field and establishing connections between various areas with significant implications for their research. + +## How to Navigate This Book + +To get the most out of this book, we recommend a structured learning approach that leverages the various resources provided. Each chapter includes slides, videos, exercises, and labs to cater to different learning styles and reinforce your understanding. + +1. **Fundamentals (Chapters 1-3):** Start by building a strong foundation with the initial chapters, which provide an introduction to AI and cover core topics like AI systems and deep learning. + +2. **Workflow (Chapters 4-6):** With that foundation, move on to the chapters focused on practical aspects of the AI model building process like workflows, data engineering, and frameworks. + +3. **Training (Chapters 7-10):** These chapters offer insights into effectively training AI models, including techniques for efficiency, optimizations, and acceleration. + +4. **Deployment (Chapters 11-13):** Learn about deploying AI on devices and monitoring the operationalization through methods like benchmarking, on-device learning, and MLOps. + +5. **Advanced Topics (Chapters 14-18):** Critically examine topics like security, privacy, ethics, sustainability, robustness, and generative AI. + +6. **Social Impact (Chapter 19):** Explore the positive applications and potential of AI for societal good. + +7. **Conclusion (Chapter 20):** Reflect on the key takeaways and future directions in AI systems. + +While the book is designed for progressive learning, we encourage an interconnected learning approach that allows you to navigate chapters based on your interests and needs. Throughout the book, you'll find case studies and hands-on exercises that help you relate theory to real-world applications. We also recommend participating in forums and groups to engage in [discussions](https://github.com/harvard-edge/cs249r_book/discussions), debate concepts, and share insights with fellow learners. Regularly revisiting chapters can help reinforce your learning and offer new perspectives on the concepts covered. By adopting this structured yet flexible approach and actively engaging with the content and the community, you'll embark on a fulfilling and enriching learning experience that maximizes your understanding. + +## Chapter-by-Chapter Insights + +Here's a closer look at what each chapter covers. We have structured the book into six main sections: Fundamentals, Workflow, Training, Deployment, Advanced Topics, and Impact. These sections closely reflect the major components of a typical machine learning pipeline, from understanding the basic concepts to deploying and maintaining AI systems in real-world applications. By organizing the content in this manner, we aim to provide a logical progression that mirrors the actual process of developing and implementing AI systems. + +### Fundamentals + +In the Fundamentals section, we lay the groundwork for understanding AI. This is far from being a thorough deep dive into the algorithms, but we aim to introduce key concepts, provide an overview of machine learning systems, and dive into the principles and algorithms of deep learning that power AI applications in their associated systems. This section equips you with the essential knowledge needed to grasp the subsequent chapters. + +1. **[Introduction:](./core/introduction/introduction.qmd)** This chapter sets the stage, providing an overview of AI and laying the groundwork for the chapters that follow. +2. **[ML Systems:](./core/ml_systems/ml_systems.qmd)** We introduce the basics of machine learning systems, the platforms where AI algorithms are widely applied. +3. **[Deep Learning Primer:](./core/dl_primer/dl_primer.qmd)** This chapter offers a brief introduction to the algorithms and principles that underpin AI applications in ML systems. + +### Workflow + +The Workflow section guides you through the practical aspects of building AI models. We break down the AI workflow, discuss data engineering best practices, and review popular AI frameworks. By the end of this section, you'll have a clear understanding of the steps involved in developing proficient AI applications and the tools available to streamline the process. + +4. **[AI Workflow:](./core/workflow/workflow.qmd)** This chapter breaks down the machine learning workflow, offering insights into the steps leading to proficient AI applications. +5. **[Data Engineering:](./core/data_engineering/data_engineering.qmd)** We focus on the importance of data in AI systems, discussing how to effectively manage and organize data. +6. **[AI Frameworks:](./core/frameworks/frameworks.qmd)** This chapter reviews different frameworks for developing machine learning models, guiding you in choosing the most suitable one for your projects. + +### Training + +In the Training section, we explore techniques for training efficient and reliable AI models. We cover strategies for achieving efficiency, model optimizations, and the role of specialized hardware in AI acceleration. This section empowers you with the knowledge to develop high-performing models that can be seamlessly integrated into AI systems. + +7. **[AI Training:](./core/training/training.qmd)** This chapter explores model training, exploring techniques for developing efficient and reliable models. +8. **[Efficient AI:](./core/efficient_ai/efficient_ai.qmd)** Here, we discuss strategies for achieving efficiency in AI applications, from computational resource optimization to performance enhancement. +9. **[Model Optimizations:](./core/optimizations/optimizations.qmd)** We explore various avenues for optimizing AI models for seamless integration into AI systems. +10. **[AI Acceleration:](./core/hw_acceleration/hw_acceleration.qmd)** We discuss the role of specialized hardware in enhancing the performance of AI systems. + +### Deployment + +The Deployment section focuses on the challenges and solutions for deploying AI models. We discuss benchmarking methods to evaluate AI system performance, techniques for on-device learning to improve efficiency and privacy, and the processes involved in ML operations. This section equips you with the skills to effectively deploy and maintain AI functionalities in AI systems. + +11. **[Benchmarking AI:](./core/benchmarking/benchmarking.qmd)** This chapter focuses on how to evaluate AI systems through systematic benchmarking methods. +12. **[On-Device Learning:](./core/ondevice_learning/ondevice_learning.qmd)** We explore techniques for localized learning, which enhances both efficiency and privacy. +13. **[ML Operations:](./core/ops/ops.qmd)** This chapter looks at the processes involved in the seamless integration, monitoring, and maintenance of AI functionalities. + +### Advanced Topics + +In the Advanced Topics section, We will study the critical issues surrounding AI. We address privacy and security concerns, explore the ethical principles of responsible AI, discuss strategies for sustainable AI development, examine techniques for building robust AI models, and introduce the exciting field of generative AI. This section broadens your understanding of the complex landscape of AI and prepares you to navigate its challenges. + +14. **[Security & Privacy:](./core/privacy_security/privacy_security.qmd)** As AI becomes more ubiquitous, this chapter addresses the crucial aspects of privacy and security in AI systems. +15. **[Responsible AI:](./core/responsible_ai/responsible_ai.qmd)** We discuss the ethical principles guiding the responsible use of AI, focusing on fairness, accountability, and transparency. +16. **[Sustainable AI:](./core/sustainable_ai/sustainable_ai.qmd)** This chapter explores practices and strategies for sustainable AI, ensuring long-term viability and reduced environmental impact. +17. **[Robust AI:](./core/robust_ai/robust_ai.qmd)** We discuss techniques for developing reliable and robust AI models that can perform consistently across various conditions. +18. **[Generative AI:](./core/generative_ai/generative_ai.qmd)** This chapter explores the algorithms and techniques behind generative AI, opening avenues for innovation and creativity. + +### Social Impact + +The Impact section highlights the transformative potential of AI in various domains. We showcase real-world applications of TinyML in healthcare, agriculture, conservation, and other areas where AI is making a positive difference. This section inspires you to leverage the power of AI for societal good and to contribute to the development of impactful solutions. + +19. **[AI for Good:](./core/ai_for_good/ai_for_good.qmd)** We highlight positive applications of TinyML in areas like healthcare, agriculture, and conservation. + +### Closing + +In the Closing section, we reflect on the key learnings from the book and look ahead to the future of AI. We synthesize the concepts covered, discuss emerging trends, and provide guidance on continuing your learning journey in this rapidly evolving field. This section leaves you with a comprehensive understanding of AI and the excitement to apply your knowledge in innovative ways. + +20. **[Conclusion:](./core/conclusion/conclusion.qmd)** The book concludes with a reflection on the key learnings and future directions in the field of AI. + +## Tailored Learning + +We understand that readers have diverse interests; some may wish to grasp the fundamentals, while others are eager to delve into advanced topics like hardware acceleration or AI ethics. To help you navigate the book more effectively, we've created a persona-based reading guide tailored to your specific interests and goals. This guide assists you in identifying the reader persona that best matches your interests. Each persona represents a distinct reader profile with specific objectives. By selecting the persona that resonates with you, you can focus on the chapters and sections most relevant to your needs. + ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| Persona | Description | Chapters | Focus | ++:=======================+:=========================================================================+:==============================================+:==========================================================================================================+ +| The TinyML Newbie | You are new to the field of TinyML and eager to learn the basics. | 1-3, 8, 9, 10, 12 | Understand the fundamentals, gain insights into efficient and optimized ML, | +| | | | and learn about on-device learning. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| The EdgeML Enthusiast | You have some TinyML knowledge and are interested in exploring | 1-3, 8, 9, 10, 12, 13 | Build a strong foundation, delve into the intricacies of efficient ML, | +| | the broader world of EdgeML. | | and explore the operational aspects of embedded systems. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| The Computer Visionary | You are fascinated by computer vision and its applications in TinyML | 1-3, 5, 8-10, 12, 13, 17, 20 | Start with the basics, explore data engineering, and study methods for optimizing ML | +| | and EdgeML. | | models. Learn about robustness and the future of ML systems. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| The Data Maestro | You are passionate about data and its crucial role in ML systems. | 1-5, 8-13 | Gain a comprehensive understanding of data's role in ML systems, explore the ML | +| | | | workflow, and dive into model optimization and deployment considerations. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| The Hardware Hero | You are excited about the hardware aspects of ML systems and how | 1-3, 6, 8-10, 12, 14, 17, 20 | Build a solid foundation in ML systems and frameworks, explore challenges of | +| | they impact model performance. | | optimizing models for efficiency, hardware-software co-design, and security aspects. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| The Sustainability | You are an advocate for sustainability and want to learn how to | 1-3, 8-10, 12, 15, 16, 20 | Begin with the fundamentals of ML systems and TinyML, explore model optimization | +| Champion | develop eco-friendly AI systems. | | techniques, and learn about responsible and sustainable AI practices. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| The AI Ethicist | You are concerned about the ethical implications of AI and want to | 1-3, 5, 7, 12, 14-16, 19, 20 | Gain insights into the ethical considerations surrounding AI, including fairness, | +| | ensure responsible development and deployment. | | privacy, sustainability, and responsible development practices. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +| The Full-Stack ML | You are a seasoned ML expert and want to deepen your understanding | The entire book | Understand the end-to-end process of building and deploying ML systems, from data | +| Engineer | of the entire ML system stack. | | engineering and model optimization to hardware acceleration and ethical considerations. | ++------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ + +## Join the Community + +Learning in the fast-paced world of AI is a collaborative journey. We set out to nurture a vibrant community of learners, innovators, and contributors. As you explore the concepts and engage with the exercises, we encourage you to share your insights and experiences. Whether it's a novel approach, an interesting application, or a thought-provoking question, your contributions can enrich the learning ecosystem. Engage in discussions, offer and seek guidance, and collaborate on projects to foster a culture of mutual growth and learning. By sharing knowledge, you play an important role in fostering a globally connected, informed, and empowered community. diff --git a/contents/core/introduction/image.png b/contents/core/introduction/image.png new file mode 100644 index 00000000..309a9612 Binary files /dev/null and b/contents/core/introduction/image.png differ diff --git a/contents/core/introduction/images/png/alexnet_arch.png b/contents/core/introduction/images/png/alexnet_arch.png new file mode 100644 index 00000000..0ac147ac Binary files /dev/null and b/contents/core/introduction/images/png/alexnet_arch.png differ diff --git a/contents/core/introduction/images/png/book_pillars.png b/contents/core/introduction/images/png/book_pillars.png new file mode 100644 index 00000000..7949c0ca Binary files /dev/null and b/contents/core/introduction/images/png/book_pillars.png differ diff --git a/contents/core/introduction/images/png/farmbeats.png b/contents/core/introduction/images/png/farmbeats.png new file mode 100644 index 00000000..e22fc190 Binary files /dev/null and b/contents/core/introduction/images/png/farmbeats.png differ diff --git a/contents/core/introduction/images/png/hidden_debt.png b/contents/core/introduction/images/png/hidden_debt.png new file mode 100644 index 00000000..d171877b Binary files /dev/null and b/contents/core/introduction/images/png/hidden_debt.png differ diff --git a/contents/core/introduction/images/png/ml_lifecycle_overview.png b/contents/core/introduction/images/png/ml_lifecycle_overview.png new file mode 100644 index 00000000..53124511 Binary files /dev/null and b/contents/core/introduction/images/png/ml_lifecycle_overview.png differ diff --git a/contents/core/introduction/images/png/triangle.png b/contents/core/introduction/images/png/triangle.png new file mode 100644 index 00000000..3b2b9e56 Binary files /dev/null and b/contents/core/introduction/images/png/triangle.png differ diff --git a/contents/core/introduction/introduction.qmd b/contents/core/introduction/introduction.qmd index 79e0e122..67c16b18 100644 --- a/contents/core/introduction/introduction.qmd +++ b/contents/core/introduction/introduction.qmd @@ -6,156 +6,407 @@ bibliography: introduction.bib ![_DALL·E 3 Prompt: A detailed, rectangular, flat 2D illustration depicting a roadmap of a book's chapters on machine learning systems, set on a crisp, clean white background. The image features a winding road traveling through various symbolic landmarks. Each landmark represents a chapter topic: Introduction, ML Systems, Deep Learning, AI Workflow, Data Engineering, AI Frameworks, AI Training, Efficient AI, Model Optimizations, AI Acceleration, Benchmarking AI, On-Device Learning, Embedded AIOps, Security & Privacy, Responsible AI, Sustainable AI, AI for Good, Robust AI, Generative AI. The style is clean, modern, and flat, suitable for a technical book, with each landmark clearly labeled with its chapter title._](images/png/cover_introduction.png) -## Overview +## Why Machine Learning Systems Matter -In the early 1990s, [Mark Weiser](https://en.wikipedia.org/wiki/Mark_Weiser), a pioneering computer scientist, introduced the world to a revolutionary concept that would forever change how we interact with technology. This vision was succinctly captured in his seminal paper, "The Computer for the 21st Century" (see @fig-ubiquitous). Weiser envisioned a future where computing would be seamlessly integrated into our environments, becoming an invisible, integral part of daily life. +AI is everywhere. Consider your morning routine: You wake up to an AI-powered smart alarm that learned your sleep patterns. Your phone suggests your route to work, having learned from traffic patterns. During your commute, your music app automatically creates a playlist it thinks you'll enjoy. At work, your email client filters spam and prioritizes important messages. Throughout the day, your smartwatch monitors your activity, suggesting when to move or exercise. In the evening, your streaming service recommends shows you might like, while your smart home devices adjust lighting and temperature based on your learned preferences. + +But these everyday conveniences are just the beginning. AI is transforming our world in extraordinary ways. Today, AI systems detect early-stage cancers with unprecedented accuracy, predict and track extreme weather events to save lives, and accelerate drug discovery by simulating millions of molecular interactions. Autonomous vehicles navigate complex city streets while processing real-time sensor data from dozens of sources. Language models engage in sophisticated conversations, translate between hundreds of languages, and help scientists analyze vast research databases. In scientific laboratories, AI systems are making breakthrough discoveries - from predicting protein structures that unlock new medical treatments to identifying promising materials for next-generation solar cells and batteries. Even in creative fields, AI collaborates with artists and musicians to explore new forms of expression, pushing the boundaries of human creativity. + +This isn't science fiction---it's the reality of how artificial intelligence, specifically machine learning systems, has become woven into the fabric of our daily lives. In the early 1990s, [Mark Weiser](https://en.wikipedia.org/wiki/Mark_Weiser), a pioneering computer scientist, introduced the world to a revolutionary concept that would forever change how we interact with technology. This vision was succinctly captured in his seminal paper, "The Computer for the 21st Century" (see @fig-ubiquitous). Weiser envisioned a future where computing would be seamlessly integrated into our environments, becoming an invisible, integral part of daily life. ![Ubiquitous computing as envisioned by Mark Weiser.](images/png/21st_computer.png){#fig-ubiquitous width=50%} -He termed this concept "ubiquitous computing," promising a world where technology would serve us without demanding our constant attention or interaction. Fast forward to today, and we find ourselves on the cusp of realizing Weiser's vision, thanks to the advent and proliferation of machine learning systems. +He termed this concept "ubiquitous computing," promising a world where technology would serve us without demanding our constant attention or interaction. Today, we find ourselves living in Weiser's envisioned future, largely enabled by machine learning systems. The true essence of his vision—creating an intelligent environment that can anticipate our needs and act on our behalf—has become reality through the development and deployment of ML systems that span entire ecosystems, from powerful cloud data centers to edge devices to the tiniest IoT sensors. + +Yet most of us rarely think about the complex systems that make this possible. Behind each of these seemingly simple interactions lies a sophisticated infrastructure of data, algorithms, and computing resources working together. Understanding how these systems work—their capabilities, limitations, and requirements—has become increasingly critical as they become more integrated into our world. + +To appreciate the magnitude of this transformation and the complexity of modern machine learning systems, we need to understand how we got here. The journey from early artificial intelligence to today's ubiquitous ML systems is a story of not just technological evolution, but of changing perspectives on what's possible and what's necessary to make AI practical and reliable. + +## The Evolution of AI + +The evolution of AI, depicted in the timeline shown in @fig-ai-timeline, highlights key milestones such as the development of the "perceptron"[^defn-perceptron] in 1957 by Frank Rosenblatt, a foundational element for modern neural networks. Imagine walking into a computer lab in 1965. You'd find room-sized mainframes running programs that could prove basic mathematical theorems or play simple games like tic-tac-toe. These early artificial intelligence systems, while groundbreaking for their time, were a far cry from today's machine learning systems that can detect cancer in medical images or understand human speech. The timeline shows the progression from early innovations like the ELIZA chatbot in 1966, to significant breakthroughs such as IBM's Deep Blue defeating chess champion Garry Kasparov in 1997. More recent advancements include the introduction of OpenAI's GPT-3 in 2020 and GPT-4 in 2023, demonstrating the dramatic evolution and increasing complexity of AI systems over the decades. + +[^defn-perceptron]: The first artificial neural network—a simple model that could learn to classify visual patterns, similar to a single neuron making a yes/no decision based on its inputs. + +![Milestones in AI from 1950 to 2020. Source: IEEE Spectrum](https://spectrum.ieee.org/media-library/a-chart-of-milestones-in-ai-from-1950-to-2020.png?id=27547255){#fig-ai-timeline} + +Let's explore how we got here. + +### Symbolic AI (1956-1974) + +The story of machine learning begins at the historic Dartmouth Conference in 1956, where pioneers like John McCarthy, Marvin Minsky, and Claude Shannon first coined the term "artificial intelligence." Their approach was based on a compelling idea: intelligence could be reduced to symbol manipulation. Consider Daniel Bobrow's STUDENT system from 1964, one of the first AI programs that could solve algebra word problems: + +::: {.callout-note} +### Example: STUDENT (1964) + +``` +Problem: "If the number of customers Tom gets is twice the +square of 20% of the number of advertisements he runs, and +the number of advertisements is 45, what is the number of +customers Tom gets?" + +STUDENT would: + +1. Parse the English text +2. Convert it to algebraic equations +3. Solve the equation: n = 2(0.2 × 45)² +4. Provide the answer: 162 customers +``` +::: + +Early AI like STUDENT suffered from a fundamental limitation: they could only handle inputs that exactly matched their pre-programmed patterns and rules. Imagine a language translator that only works when sentences follow perfect grammatical structure---even slight variations like changing word order, using synonyms, or natural speech patterns would cause the STUDENT to fail. This "brittleness" meant that while these solutions could appear intelligent when handling very specific cases they were designed for, they would break down completely when faced with even minor variations or real-world complexity. This limitation wasn't just a technical inconvenience—it revealed a deeper problem with rule-based approaches to AI: they couldn't genuinely understand or generalize from their programming, they could only match and manipulate patterns exactly as specified. + +### Expert Systems(1970s-1980s) + +By the mid-1970s, researchers realized that general AI was too ambitious. Instead, they focused on capturing human expert knowledge in specific domains. MYCIN, developed at Stanford, was one of the first large-scale expert systems designed to diagnose blood infections: + +::: {.callout-note} +### Example: MYCIN (1976) +``` +Rule Example from MYCIN: +IF + The infection is primary-bacteremia + The site of the culture is one of the sterile sites + The suspected portal of entry is the gastrointestinal tract +THEN + There is suggestive evidence (0.7) that infection is bacteroid +``` +::: + +While MYCIN represented a major advance in medical AI with its 600 expert rules for diagnosing blood infections, it revealed fundamental challenges that still plague ML today. Getting domain knowledge from human experts and converting it into precise rules proved incredibly time-consuming and difficult—doctors often couldn't explain exactly how they made decisions. MYCIN struggled with uncertain or incomplete information, unlike human doctors who could make educated guesses. Perhaps most importantly, maintaining and updating the rule base became exponentially more complex as MYCIN grew—adding new rules often conflicted with existing ones, and medical knowledge itself kept evolving. These same challenges of knowledge capture, uncertainty handling, and maintenance remain central concerns in modern machine learning, even though we now use different technical approaches to address them. + +### Statistical Learning: A Paradigm Shift (1990s) + +The 1990s marked a radical transformation in artificial intelligence as the field moved away from hand-coded rules toward statistical learning approaches. This wasn't a simple choice—it was driven by three converging factors that made statistical methods both possible and powerful. The digital revolution meant massive amounts of data were suddenly available to train the algorithms. **Moore's Law**[^defn-mooreslaw] delivered the computational power needed to process this data effectively. And researchers developed new algorithms like Support Vector Machines and improved neural networks that could actually learn patterns from this data rather than following pre-programmed rules. This combination fundamentally changed how we built AI: instead of trying to encode human knowledge directly, we could now let machines discover patterns automatically from examples, leading to more robust and adaptable AI. + +[^defn-mooreslaw]: The observation made by Intel co-founder Gordon Moore in 1965 that the number of transistors on a microchip doubles approximately every two years, while the cost halves. This exponential growth in computing power has been a key driver of advances in machine learning, though the pace has begun to slow in recent years. + +Consider how email spam filtering evolved: + +::: {.callout-note} +### Example: Early Spam Detection Systems + +``` +Rule-based (1980s): +IF contains("viagra") OR contains("winner") THEN spam + +Statistical (1990s): +P(spam|word) = (frequency in spam emails) / (total frequency) +Combined using Naive Bayes: +P(spam|email) ∝ P(spam) × ∏ P(word|spam) +``` +::: + +The move to statistical approaches fundamentally changed how we think about building AI by introducing three core concepts that remain important today. First, the quality and quantity of training data became as important as the algorithms themselves---AI could only learn patterns that were present in its training examples. Second, we needed rigorous ways to evaluate how well AI actually performed, leading to metrics that could measure success and compare different approaches. Third, we discovered an inherent tension between precision (being right when we make a prediction) and recall (catching all the cases we should find), forcing designers to make explicit trade-offs based on their application's needs. For example, a spam filter might tolerate some spam to avoid blocking important emails, while medical diagnosis might need to catch every potential case even if it means more false alarms. + +@tbl-ai-evolution-strengths encapsulates the evolutionary journey of AI approaches we have discussed so far, highlighting the key strengths and capabilities that emerged with each new paradigm. As we move from left to right across the table, we can observe several important trends. We will talk about shallow and deep learning next, but it is useful to understand the trade-offs between the approaches we have covered so far. + ++---------------------+--------------------------+--------------------------+--------------------------+-------------------------------+ +| Aspect | Symbolic AI | Expert Systems | Statistical Learning | Shallow / Deep Learning | ++:====================+:=========================+:=========================+:=========================+:==============================+ +| Key Strength | Logical reasoning | Domain expertise | Versatility | Pattern recognition | ++---------------------+--------------------------+--------------------------+--------------------------+-------------------------------+ +| Best Use Case | Well-defined, rule-based | Specific domain problems | Various structured data | Complex, unstructured data | +| | problems | | problems | problems | ++---------------------+--------------------------+--------------------------+--------------------------+-------------------------------+ +| Data Handling | Minimal data needed | Domain knowledge-based | Moderate data required | Large-scale data processing | ++---------------------+--------------------------+--------------------------+--------------------------+-------------------------------+ +| Adaptability | Fixed rules | Domain-specific | Adaptable to various | Highly adaptable to diverse | +| | | adaptability | domains | tasks | ++---------------------+--------------------------+--------------------------+--------------------------+-------------------------------+ +| Problem Complexity | Simple, logic-based | Complicated, domain- | Complex, structured | Highly complex, unstructured | +| | | specific | | | ++---------------------+--------------------------+--------------------------+--------------------------+-------------------------------+ + +: Evolution of AI - Key Positive Aspects {#tbl-ai-evolution-strengths .hover .striped} + +The table serves as a bridge between the early approaches we've discussed and the more recent developments in shallow and deep learning that we'll explore next. It sets the stage for understanding why certain approaches gained prominence in different eras and how each new paradigm built upon and addressed the limitations of its predecessors. Moreover, it illustrates how the strengths of earlier approaches continue to influence and enhance modern AI techniques, particularly in the era of foundation models. + +### Shallow Learning (2000s) + +The 2000s marked a fascinating period in machine learning history that we now call the "shallow learning" era. To understand why it's "shallow," imagine building a house: deep learning (which came later) is like having multiple construction crews working at different levels simultaneously, each crew learning from the work of crews below them. In contrast, shallow learning typically had just one or two levels of processing - like having just a foundation crew and a framing crew. + +During this time, several powerful algorithms dominated the machine learning landscape. Each brought unique strengths to different problems: Decision trees provided interpretable results by making choices much like a flowchart. K-nearest neighbors made predictions by finding similar examples in past data, like asking your most experienced neighbors for advice. Linear and logistic regression offered straightforward, interpretable models that worked well for many real-world problems. Support Vector Machines (SVMs) excelled at finding complex boundaries between categories using the "kernel trick" - imagine being able to untangle a bowl of spaghetti into straight lines by lifting it into a higher dimension. +These algorithms formed the foundation of practical machine learning because: +Consider a typical computer vision solution from 2005: + +::: {.callout-note} +### Example: Traditional Computer Vision Pipeline +``` +1. Manual Feature Extraction + - SIFT (Scale-Invariant Feature Transform) + - HOG (Histogram of Oriented Gradients) + - Gabor filters +2. Feature Selection/Engineering +3. "Shallow" Learning Model (e.g., SVM) +4. Post-processing +``` +::: + +What made this era distinct was its hybrid approach: human-engineered features combined with statistical learning. They had strong mathematical foundations (researchers could prove why they worked). They performed well even with limited data. They were computationally efficient. They produced reliable, reproducible results. + +Take the example of face detection, where the Viola-Jones algorithm (2001) achieved real-time performance using simple rectangular features and a cascade of classifiers. This algorithm powered digital camera face detection for nearly a decade. + +### Deep Learning (2012-Present) + +While Support Vector Machines excelled at finding complex boundaries between categories using mathematical transformations, deep learning took a radically different approach inspired by the human brain's architecture. Deep learning is built from layers of artificial neurons, where each layer learns to transform its input data into increasingly abstract representations. Imagine processing an image of a cat: the first layer might learn to detect simple edges and contrasts, the next layer combines these into basic shapes and textures, another layer might recognize whiskers and pointy ears, and the final layers assemble these features into the concept of "cat." Unlike shallow learning methods that required humans to carefully engineer features, deep learning networks can automatically discover useful features directly from raw data. This ability to learn hierarchical representations—from simple to complex, concrete to abstract—is what makes deep learning "deep," and it turned out to be a remarkably powerful approach for handling complex, real-world data like images, speech, and text. + +In 2012, a deep neural network called AlexNet, shown in @fig-alexnet, achieved a breakthrough in the ImageNet competition that would transform the field of machine learning. The challenge was formidable: correctly classify 1.2 million high-resolution images into 1,000 different categories. While previous approaches struggled with error rates above 25%, AlexNet achieved a 15.3% error rate, dramatically outperforming all existing methods. + +![Deep neural network architecture for Alexnet.](./images/png/alexnet_arch.png){#fig-alexnet} + +The success of AlexNet wasn't just a technical achievement---it was a watershed moment that demonstrated the practical viability of deep learning. It showed that with sufficient data, computational power, and architectural innovations, neural networks could outperform hand-engineered features and shallow learning methods that had dominated the field for decades. This single result triggered an explosion of research and applications in deep learning that continues to this day. + +From this foundation, deep learning entered an era of unprecedented scale. By the late 2010s, companies like Google, Facebook, and OpenAI were training neural networks thousands of times larger than **AlexNet**[^defn-alexnet]. These massive models, often called "foundation models," took deep learning to new heights. GPT-3, released in 2020, contained 175 billion **parameters**[^defn-parameters]---imagine a student that could read through all of Wikipedia multiple times and learn patterns from every article. These models showed remarkable abilities: writing human-like text, engaging in conversation, generating images from descriptions, and even writing computer code. The key insight was simple but powerful: as we made neural networks bigger and fed them more data, they became capable of solving increasingly complex tasks. However, this scale brought unprecedented systems challenges: how do you efficiently train models that require thousands of GPUs working in parallel? How do you store and serve models that are hundreds of gigabytes in size? How do you handle the massive datasets needed for training? + +[^defn-alexnet]: A breakthrough deep neural network from 2012 that won the [ImageNet competition](https://www.image-net.org/challenges/LSVRC/) by a large margin and helped spark the deep learning revolution. + +[^defn-parameters]: Similar to how the brain's neural connections grow stronger as you learn a new skill, having more parameters generally means that the model can learn more complex patterns. + +The deep learning revolution of 2012 didn't emerge from nowhere---it was built on neural network research dating back to the 1950s. The story begins with Frank Rosenblatt's Perceptron in 1957, which captured the imagination of researchers by showing how a simple artificial neuron could learn to classify patterns. While it could only handle linearly separable problems—a limitation dramatically highlighted by Minsky and Papert's 1969 book "Perceptrons"—it introduced the fundamental concept of trainable neural networks. The 1980s brought more important breakthroughs: Rumelhart, Hinton, and Williams introduced backpropagation in 1986, providing a systematic way to train multi-layer networks, while Yann LeCun demonstrated its practical application in recognizing handwritten digits using **convolutional neural networks (CNNs)**[^defn-cnn]. + +[^defn-cnn]: A type of neural network specially designed for processing images, inspired by how the human visual system works. The "convolutional" part refers to how it scans images in small chunks, similar to how our eyes focus on different parts of a scene. + +:::{#vid-tl .callout-important} + +# Convolutional Network Demo from 1989 + +{{< video https://www.youtube.com/watch?v=FwFduRA_L6Q&ab_channel=YannLeCun >}} + +::: + +Yet these networks largely languished through the 1990s and 2000s, not because the ideas were wrong, but because they were ahead of their time---the field lacked three important ingredients: sufficient data to train complex networks, enough computational power to process this data, and the technical innovations needed to train very deep networks effectively. + +The field had to wait for the convergence of big data, better computing hardware, and algorithmic breakthroughs before deep learning's potential could be unlocked. This long gestation period helps explain why the 2012 ImageNet moment was less a sudden revolution and more the culmination of decades of accumulated research finally finding its moment. As we'll explore in the following sections, this evolution has led to two significant developments in the field. First, it has given rise to define the field of machine learning systems engineering, a discipline that teaches how to bridge the gap between theoretical advancements and practical implementation. Second, it has necessitated a more comprehensive definition of machine learning systems, one that encompasses not just algorithms, but also data and computing infrastructure. Today's challenges of scale echo many of the same fundamental questions about computation, data, and learning methods that researchers have grappled with since the field's inception, but now within a more complex and interconnected framework. + +## The Rise of ML Systems Engineering + +The story we've traced--from the early days of the Perceptron through the deep learning revolution---has largely been one of algorithmic breakthroughs. Each era brought new mathematical insights and modeling approaches that pushed the boundaries of what AI could achieve. But something important changed over the past decade: the success of AI systems became increasingly dependent not just on algorithmic innovations, but on sophisticated engineering. + +This shift mirrors the evolution of computer science and engineering in the late 1960s and early 1970s. During that period, as computing systems grew more complex, a new discipline emerged: Computer Engineering. This field bridged the gap between Electrical Engineering's hardware expertise and Computer Science's focus on algorithms and software. Computer Engineering arose because the challenges of designing and building complex computing systems required an integrated approach that neither discipline could fully address on its own. + +Today, we're witnessing a similar transition in the field of AI. While Computer Science continues to push the boundaries of ML algorithms and Electrical Engineering advances specialized AI hardware, neither discipline fully addresses the engineering principles needed to deploy, optimize, and sustain ML systems at scale. This gap highlights the need for a new discipline: Machine Learning Systems Engineering. While there is no explicit definition of what this field is as such today, it can be broadly defined as such: + +> Machine Learning Systems Engineering is the discipline that focuses on the design, development, deployment, and maintenance of large-scale machine learning systems. It encompasses the entire lifecycle of ML applications, from data collection and preprocessing to model training, deployment, monitoring, and continuous improvement. MLSE integrates principles from software engineering, distributed systems, data engineering, and machine learning to create robust, scalable, and efficient AI systems that can operate reliably in real-world environments. + +Let's consider space exploration. While astronauts venture into new frontiers and explore the vast unknowns of the universe, their discoveries are only possible because of the complex engineering systems supporting them---the rockets that lift them into space, the life support systems that keep them alive, and the communication networks that keep them connected to Earth. Similarly, while AI researchers push the boundaries of what's possible with learning algorithms, their breakthroughs only become practical reality through careful systems engineering. Modern AI systems need robust infrastructure to collect and manage data, powerful computing systems to train models, and reliable deployment platforms to serve millions of users. + +This emergence of machine learning systems engineering as a important discipline reflects a broader reality: turning AI algorithms into real-world systems requires bridging the gap between theoretical possibilities and practical implementation. It's not enough to have a brilliant algorithm if you can't efficiently collect and process the data it needs, distribute its computation across hundreds of machines, serve it reliably to millions of users, or monitor its performance in production. + +Understanding this interplay between algorithms and engineering has become fundamental for modern AI practitioners. While researchers continue to push the boundaries of what's algorithmically possible, engineers are tackling the complex challenge of making these algorithms work reliably and efficiently in the real world. This brings us to a fundamental question: what exactly is a machine learning system, and what makes it different from traditional software systems? + +## Definition of a ML System + +There's no universally accepted, clear-cut textbook definition of a machine learning system. This ambiguity stems from the fact that different practitioners, researchers, and industries often refer to machine learning systems in varying contexts and with different scopes. Some might focus solely on the algorithmic aspects, while others might include the entire pipeline from data collection to model deployment. This loose usage of the term reflects the rapidly evolving and multidisciplinary nature of the field. + +Given this diversity of perspectives, it is important to establish a clear and comprehensive definition that encompasses all these aspects. In this textbook, we take a holistic approach to machine learning systems, considering not just the algorithms but also the entire ecosystem in which they operate. Therefore, we define a machine learning system as follows: + +> A machine learning system is an integrated computing system that consists of three essential elements: data, algorithms, and computing infrastructure, where data represents the input that controls the behavior of the algorithms that learn from that data, which in turn rely on underlying hardware and software infrastructure to execute the learning process (training) and/or the application of learned knowledge (inference or serving), and all together, these components enable the system to make predictions, generate content, or take actions based on learned patterns. + +The core of any machine learning system consists of three interrelated components, as illustrated in @fig-ai-triangle: Models/Algorithms, Data, and Computing Infrastructure. These components form a triangular dependency where each element fundamentally shapes the possibilities of the others. The model architecture dictates both the computational demands for training and inference, as well as the volume and structure of data required for effective learning. The data's scale and complexity influence what infrastructure is needed for storage and processing, while simultaneously determining which model architectures are feasible. The infrastructure capabilities establish practical limits on both model scale and data processing capacity, creating a framework within which the other components must operate. + +![Machine learning systems involve algorithms, data, and computation, all intertwined together.](images/png/triangle.png){#fig-ai-triangle} + +Each of these components serves a distinct but interconnected purpose: + +- **Algorithms:** Mathematical models and methods that learn patterns from data to make predictions or decisions + +- **Data:** Processes and infrastructure for collecting, storing, processing, managing, and serving data for both training and inference. + +- **Computing:** Hardware and software infrastructure that enables efficient training, serving, and operation of models at scale. + +The interdependency of these components means no single element can function in isolation. The most sophisticated algorithm cannot learn without data or computing resources to run on. The largest datasets are useless without algorithms to extract patterns or infrastructure to process them. And the most powerful computing infrastructure serves no purpose without algorithms to execute or data to process. + +To illustrate these relationships, we can draw an analogy to space exploration. Algorithm developers are like astronauts---exploring new frontiers and making discoveries. Data science teams function like mission control specialists—ensuring the constant flow of critical information and resources needed to keep the mission running. Computing infrastructure engineers are like rocket engineers—designing and building the systems that make the mission possible. Just as a space mission requires the seamless integration of astronauts, mission control, and rocket systems, a machine learning system demands the careful orchestration of algorithms, data, and computing infrastructure. + +## The ML Systems Lifecycle + +Traditional software systems follow a predictable lifecycle where developers write explicit instructions for computers to execute. These systems are built on decades of established software engineering practices. Version control systems maintain precise histories of code changes. Continuous integration and deployment pipelines automate testing and release processes. Static analysis tools measure code quality and identify potential issues. This infrastructure enables reliable development, testing, and deployment of software systems, following well-defined principles of software engineering. + +Machine learning systems represent a fundamental departure from this traditional paradigm. While traditional systems execute explicit programming logic, machine learning systems derive their behavior from patterns in data. This shift from code to data as the primary driver of system behavior introduces new complexities. + +As illustrated in @fig-ml_lifecycle_overview, the ML lifecycle consists of interconnected stages from data collection through model monitoring, with feedback loops for continuous improvement when performance degrades or models need enhancement. + +![The typical lifecycle of a machine learning system.](./images/png/ml_lifecycle_overview.png){#fig-ml_lifecycle_overview} + +Unlike source code, which changes only when developers modify it, data reflects the dynamic nature of the real world. Changes in data distributions can silently alter system behavior. Traditional software engineering tools, designed for deterministic code-based systems, prove insufficient for managing these data-dependent systems. For example, version control systems that excel at tracking discrete code changes struggle to manage large, evolving datasets. Testing frameworks designed for deterministic outputs must be adapted for probabilistic predictions. This data-dependent nature creates a more dynamic lifecycle, requiring continuous monitoring and adaptation to maintain system relevance as real-world data patterns evolve. + +Understanding the machine learning system lifecycle requires examining its distinct stages. Each stage presents unique requirements from both learning and infrastructure perspectives. This dual consideration---of learning needs and systems support---is wildly important for building effective machine learning systems. + +However, the various stages of the ML lifecycle in production are not isolated; they are, in fact, deeply interconnected. This interconnectedness can create either virtuous or vicious cycles. In a virtuous cycle, high-quality data enables effective learning, robust infrastructure supports efficient processing, and well-engineered systems facilitate the collection of even better data. However, in a vicious cycle, poor data quality undermines learning, inadequate infrastructure hampers processing, and system limitations prevent the improvement of data collection—each problem compounds the others. + +## The Spectrum of ML Systems + +The complexity of managing machine learning systems becomes even more apparent when we consider the broad spectrum across which ML is deployed today. ML systems exist at vastly different scales and in diverse environments, each presenting unique challenges and constraints. + +At one end of the spectrum, we have cloud-based ML systems running in massive data centers. These systems, like large language models or recommendation engines, process petabytes of data and serve millions of users simultaneously. They can leverage virtually unlimited computing resources but must manage enormous operational complexity and costs. + +At the other end, we find TinyML systems running on microcontrollers and embedded devices. These systems must perform ML tasks with severe constraints on memory, computing power, and energy consumption. Imagine a smart home device, such as Alexa or Google Assistant, that must recognize voice commands using less power than a LED bulb, or a sensor that must detect anomalies while running on a battery for months or even years. + +Between these extremes, we find a rich variety of ML systems adapted for different contexts. Edge ML systems bring computation closer to data sources, reducing latency and bandwidth requirements while managing local computing resources. Mobile ML systems must balance sophisticated capabilities with battery life and processor limitations on smartphones and tablets. Enterprise ML systems often operate within specific business constraints, focusing on particular tasks while integrating with existing infrastructure. Some organizations employ hybrid approaches, distributing ML capabilities across multiple tiers to balance various requirements. + +## ML System Implications on the ML Lifecycle + +The diversity of ML systems across the spectrum represents a complex interplay of requirements, constraints, and trade-offs. These decisions fundamentally impact every stage of the ML lifecycle we discussed earlier, from data collection to continuous operation. + +Performance requirements often drive initial architectural decisions. Latency-sensitive applications, like autonomous vehicles or real-time fraud detection, might require edge or embedded architectures despite their resource constraints. Conversely, applications requiring massive computational power for training, such as large language models, naturally gravitate toward centralized cloud architectures. However, raw performance is just one consideration in a complex decision space. + +Resource management varies dramatically across architectures. Cloud systems must optimize for cost efficiency at scale—balancing expensive GPU clusters, storage systems, and network bandwidth. Edge systems face fixed resource limits and must carefully manage local compute and storage. Mobile and embedded systems operate under the strictest constraints, where every byte of memory and milliwatt of power matters. These resource considerations directly influence both model design and system architecture. + +Operational complexity increases with system distribution. While centralized cloud architectures benefit from mature deployment tools and managed services, edge and hybrid systems must handle the complexity of distributed system management. This complexity manifests throughout the ML lifecycle—from data collection and version control to model deployment and monitoring. As we discussed in our examination of technical debt, this operational complexity can compound over time if not carefully managed. + +Data considerations often introduce competing pressures. Privacy requirements or data sovereignty regulations might push toward edge or embedded architectures, while the need for large-scale training data might favor cloud approaches. The velocity and volume of data also influence architectural choices—real-time sensor data might require edge processing to manage bandwidth, while batch analytics might be better suited to cloud processing. + +Evolution and maintenance requirements must be considered from the start. Cloud architectures offer flexibility for system evolution but can incur significant ongoing costs. Edge and embedded systems might be harder to update but could offer lower operational overhead. The continuous cycle of ML systems we discussed earlier becomes particularly challenging in distributed architectures, where updating models and maintaining system health requires careful orchestration across multiple tiers. + +These trade-offs are rarely simple binary choices. Modern ML systems often adopt hybrid approaches, carefully balancing these considerations based on specific use cases and constraints. The key is understanding how these decisions will impact the system throughout its lifecycle, from initial development through continuous operation and evolution. + +### Emerging Trends + +We are just at the beginning. As machine learning systems continue to evolve, several key trends are reshaping the landscape of ML system design and deployment. + +The rise of agentic systems marks a profound evolution in ML systems. Traditional ML systems were primarily reactive—they made predictions or classifications based on input data. In contrast, agentic systems can take actions, learn from their outcomes, and adapt their behavior accordingly. These systems, exemplified by autonomous agents that can plan, reason, and execute complex tasks, introduce new architectural challenges. They require sophisticated frameworks for decision-making, safety constraints, and real-time interaction with their environment. + +Architectural evolution is being driven by new hardware and deployment patterns. Specialized AI accelerators are emerging across the spectrum—from powerful data center chips to efficient edge processors to tiny neural processing units in mobile devices. This heterogeneous computing landscape is enabling new architectural possibilities, such as dynamic model distribution across tiers based on computing capabilities and current conditions. The traditional boundaries between cloud, edge, and embedded systems are becoming increasingly fluid. + +Resource efficiency is gaining prominence as the environmental and economic costs of large-scale ML become more apparent. This has sparked innovation in model compression, efficient training techniques, and energy-aware computing. Future systems will likely need to balance the drive for more powerful models against growing sustainability concerns. This emphasis on efficiency is particularly relevant given our earlier discussion of technical debt and operational costs. + +System intelligence is moving toward more autonomous operation. Future ML systems will likely incorporate more sophisticated self-monitoring, automated resource management, and adaptive deployment strategies. This evolution builds upon the continuous cycle we discussed earlier, but with increased automation in handling data distribution shifts, model updates, and system optimization. + +Integration challenges are becoming more complex as ML systems interact with broader technology ecosystems. The need to integrate with existing software systems, handle diverse data sources, and operate across organizational boundaries is driving new approaches to system design. This integration complexity adds new dimensions to the technical debt considerations we explored earlier. + +These trends suggest that future ML systems will need to be increasingly adaptable and efficient while managing growing complexity. Understanding these directions is important for building systems that can evolve with the field while avoiding the accumulation of technical debt we discussed earlier. + +## Real-world Applications and Impact + +The ability to build and operationalize ML systems across various scales and environments has led to transformative changes across numerous sectors. This section examines how the theoretical concepts and practical considerations we have discussed manifest in tangible, impactful applications and real world impact. + +### Case Study: FarmBeats: Edge and Embedded ML for Agriculture + +FarmBeats, a project developed by Microsoft Research, shown in @fig-farmbeats-overview represents a significant advancement in the application of machine learning to agriculture. This system aims to increase farm productivity and reduce costs by leveraging AI and IoT technologies. FarmBeats exemplifies how edge and embedded ML systems can be deployed in challenging, real-world environments to solve practical problems. By bringing ML capabilities directly to the farm, FarmBeats demonstrates the potential of distributed AI systems in transforming traditional industries. + +![Microsoft Farmbeats: AI, Edge & IoT for Agriculture.](./images/png/farmbeats.png){#fig-farmbeats-overview} + +**Data Aspects** + +The data ecosystem in FarmBeats is diverse and distributed. Sensors deployed across fields collect real-time data on soil moisture, temperature, and nutrient levels. Drones equipped with multispectral cameras capture high-resolution imagery of crops, providing insights into plant health and growth patterns. Weather stations contribute local climate data, while historical farming records offer context for long-term trends. The challenge lies not just in collecting this heterogeneous data, but in managing its flow from dispersed, often remote locations with limited connectivity. FarmBeats employs innovative data transmission techniques, such as using TV white spaces (unused broadcasting frequencies) to extend internet connectivity to far-flung sensors. This approach to data collection and transmission embodies the principles of edge computing we discussed earlier, where data processing begins at the source to reduce bandwidth requirements and enable real-time decision making. + +**Algorithm/Model Aspects** + +FarmBeats uses a variety of ML algorithms tailored to agricultural applications. For soil moisture prediction, it uses temporal neural networks that can capture the complex dynamics of water movement in soil. Computer vision algorithms process drone imagery to detect crop stress, pest infestations, and yield estimates. These models must be robust to noisy data and capable of operating with limited computational resources. Machine learning methods such as "transfer learning" allow models to learn on data-rich farms to be adapted for use in areas with limited historical data. The system also incorporates a mixture of methods that combine outputs from multiple algorithms to improve prediction accuracy and reliability. A key challenge FarmBeats addresses is model personalization---adapting general models to the specific conditions of individual farms, which may have unique soil compositions, microclimates, and farming practices. + +**Computing Infrastructure Aspects** + +FarmBeats exemplifies the edge computing paradigm we explored in our discussion of the ML system spectrum. At the lowest level, embedded ML models run directly on IoT devices and sensors, performing basic data filtering and anomaly detection. Edge devices, such as ruggedized field gateways, aggregate data from multiple sensors and run more complex models for local decision-making. These edge devices operate in challenging conditions, requiring robust hardware designs and efficient power management to function reliably in remote agricultural settings. The system employs a hierarchical architecture, with more computationally intensive tasks offloaded to on-premises servers or the cloud. This tiered approach allows FarmBeats to balance the need for real-time processing with the benefits of centralized data analysis and model training. The infrastructure also includes mechanisms for over-the-air model updates, ensuring that edge devices can receive improved models as more data becomes available and algorithms are refined. -In the vision of ubiquitous computing [@weiser1991computer], the integration of processors into everyday objects is just one aspect of a larger paradigm shift. The true essence of this vision lies in creating an intelligent environment that can anticipate our needs and act on our behalf, enhancing our experiences without requiring explicit commands. To achieve this level of pervasive intelligence, it is crucial to develop and deploy machine learning systems that span the entire ecosystem, from the cloud to the edge and even to the tiniest IoT devices. +**Impact and Future Implications** -By distributing machine learning capabilities across the "computing continuum," from cloud to edge to embedded systems that surround us, we can harness the strengths of each layer while mitigating their limitations. The cloud, with its vast computational resources and storage capacity, is ideal for training complex models on large datasets and performing resource-intensive tasks. Edge devices, such as gateways and smartphones, can process data locally, enabling faster response times, improved privacy, and reduced bandwidth requirements. Finally, the tiniest IoT devices, equipped with machine learning capabilities, can make quick decisions based on sensor data, enabling highly responsive and efficient systems. +FarmBeats demonstrates how ML systems can be effectively deployed in resource-constrained, real-world environments to drive significant improvements in traditional industries. By providing farmers with AI-driven insights, the system has shown potential to increase crop yields, reduce water usage, and optimize resource allocation. Looking forward, the FarmBeats approach could be extended to address global challenges in food security and sustainable agriculture. The success of this system also highlights the growing importance of edge and embedded ML in IoT applications, where bringing intelligence closer to the data source can lead to more responsive, efficient, and scalable solutions. As edge computing capabilities continue to advance, we can expect to see similar distributed ML architectures applied to other domains, from smart cities to environmental monitoring. -This distributed intelligence is particularly crucial for applications that require real-time processing, such as autonomous vehicles, industrial automation, and smart healthcare. By processing data at the most appropriate layer of the computing continuum, we can ensure that decisions are made quickly and accurately, without relying on constant communication with a central server. +### Case Study: AlphaFold: Large-Scale Scientific ML -The migration of machine learning intelligence across the ecosystem also enables more personalized and context-aware experiences. By learning from user behavior and preferences at the edge, devices can adapt to individual needs without compromising privacy. This localized intelligence can then be aggregated and refined in the cloud, creating a feedback loop that continuously improves the overall system. +AlphaFold, developed by DeepMind, represents a landmark achievement in the application of machine learning to complex scientific problems. This AI system is designed to predict the three-dimensional structure of proteins from their amino acid sequences, a challenge known as the "protein folding problem" that has puzzled scientists for decades. AlphaFold's success demonstrates how large-scale ML systems can accelerate scientific discovery and potentially revolutionize fields like structural biology and drug design. This case study exemplifies the use of advanced ML techniques and massive computational resources to tackle problems at the frontiers of science. -However, deploying machine learning systems across the computing continuum presents several challenges. Ensuring the interoperability and seamless integration of these systems requires standardized protocols and interfaces. Security and privacy concerns must also be addressed, as the distribution of intelligence across multiple layers increases the attack surface and the potential for data breaches. +**Data Aspects** -Furthermore, the varying computational capabilities and energy constraints of devices at different layers of the computing continuum necessitate the development of efficient and adaptable machine learning models. Techniques such as model compression, federated learning, and transfer learning can help address these challenges, enabling the deployment of intelligence across a wide range of devices. +The data underpinning AlphaFold's success is vast and multifaceted. The primary dataset is the Protein Data Bank (PDB), which contains the experimentally determined structures of over 180,000 proteins. This is complemented by databases of protein sequences, which number in the hundreds of millions. AlphaFold also utilizes evolutionary data in the form of multiple sequence alignments (MSAs), which provide insights into the conservation patterns of amino acids across related proteins. The challenge lies not just in the volume of data, but in its quality and representation. Experimental protein structures can contain errors or be incomplete, requiring sophisticated data cleaning and validation processes. Moreover, the representation of protein structures and sequences in a form amenable to machine learning is a significant challenge in itself. AlphaFold's data pipeline involves complex preprocessing steps to convert raw sequence and structural data into meaningful features that capture the physical and chemical properties relevant to protein folding. -As we move towards the realization of Weiser's vision of ubiquitous computing, the development and deployment of machine learning systems across the entire ecosystem will be critical. By leveraging the strengths of each layer of the computing continuum, we can create an intelligent environment that seamlessly integrates with our daily lives, anticipating our needs and enhancing our experiences in ways that were once unimaginable. As we continue to push the boundaries of what's possible with distributed machine learning, we inch closer to a future where technology becomes an invisible but integral part of our world. +**Algorithm/Model Aspects** -![Common applications of Machine Learning. Source: [EDUCBA](https://www.educba.com/applications-of-machine-learning/)](images/png/mlapplications.png){#fig-applications-of-ml} +AlphaFold's algorithmic approach represents a tour de force in the application of deep learning to scientific problems. At its core, AlphaFold uses a novel neural network architecture that combines with techniques from computational biology. The model learns to predict inter-residue distances and torsion angles, which are then used to construct a full 3D protein structure. A key innovation is the use of "equivariant attention" layers that respect the symmetries inherent in protein structures. The learning process involves multiple stages, including initial "pretraining" on a large corpus of protein sequences, followed by fine-tuning on known structures. AlphaFold also incorporates domain knowledge in the form of physics-based constraints and scoring functions, creating a hybrid system that leverages both data-driven learning and scientific prior knowledge. The model's ability to generate accurate confidence estimates for its predictions is crucial, allowing researchers to assess the reliability of the predicted structures. -This vision is already beginning to take shape, as illustrated by the common applications of AI surrounding us in our daily lives (see @fig-applications-of-ml). From healthcare and finance to transportation and entertainment, machine learning is transforming various sectors, making our interactions with technology more intuitive and personalized. +**Computing Infrastructure Aspects** -## What's Inside the Book +The computational demands of AlphaFold epitomize the challenges of large-scale scientific ML systems. Training the model requires massive parallel computing resources, leveraging clusters of GPUs or TPUs (Tensor Processing Units) in a distributed computing environment. DeepMind utilized Google's cloud infrastructure, with the final version of AlphaFold trained on 128 TPUv3 cores for several weeks. The inference process, while less computationally intensive than training, still requires significant resources, especially when predicting structures for large proteins or processing many proteins in parallel. To make AlphaFold more accessible to the scientific community, DeepMind has collaborated with the European Bioinformatics Institute to create a [public database](https://alphafold.ebi.ac.uk/) of predicted protein structures, which itself represents a substantial computing and data management challenge. This infrastructure allows researchers worldwide to access AlphaFold's predictions without needing to run the model themselves, demonstrating how centralized, high-performance computing resources can be leveraged to democratize access to advanced ML capabilities. -In this book, we will explore the technical foundations of ubiquitous machine learning systems, the challenges of building and deploying these systems across the computing continuum, and the vast array of applications they enable. A unique aspect of this book is its function as a conduit to seminal scholarly works and academic research papers, aimed at enriching the reader's understanding and encouraging deeper exploration of the subject. This approach seeks to bridge the gap between pedagogical materials and cutting-edge research trends, offering a comprehensive guide that is in step with the evolving field of applied machine learning. +**Impact and Future Implications** -To improve the learning experience, we have included a variety of supplementary materials. Throughout the book, you will find slides that summarize key concepts, videos that provide in-depth explanations and demonstrations, exercises that reinforce your understanding, and labs that offer hands-on experience with the tools and techniques discussed. These additional resources are designed to cater to different learning styles and help you gain a deeper, more practical understanding of the subject matter. +AlphaFold's impact on structural biology has been profound, with the potential to accelerate research in areas ranging from fundamental biology to drug discovery. By providing accurate structural predictions for proteins that have resisted experimental methods, AlphaFold opens new avenues for understanding disease mechanisms and designing targeted therapies. The success of AlphaFold also serves as a powerful demonstration of how ML can be applied to other complex scientific problems, potentially leading to breakthroughs in fields like materials science or climate modeling. However, it also raises important questions about the role of AI in scientific discovery and the changing nature of scientific inquiry in the age of large-scale ML systems. As we look to the future, the AlphaFold approach suggests a new paradigm for scientific ML, where massive computational resources are combined with domain-specific knowledge to push the boundaries of human understanding. -We begin with the fundamentals, introducing key concepts in systems and machine learning, and providing a deep learning primer. We then guide you through the AI workflow, from data engineering to selecting the right AI frameworks. This workflow closely follows the lifecycle of a typical machine learning project, as illustrated in @fig-ml-lifecycle. +### Case Study: Autonomous Vehicles: Spanning the ML Spectrum -![Machine Learning project life cycle. Source:[Medium](https://ihsanulpro.medium.com/complete-machine-learning-project-flowchart-explained-0f55e52b9381)](images/png/mlprojectlifecycle.png){#fig-ml-lifecycle} +Waymo, a subsidiary of Alphabet Inc., stands at the forefront of autonomous vehicle technology, representing one of the most ambitious applications of machine learning systems to date. Evolving from the Google Self-Driving Car Project initiated in 2009, Waymo's approach to autonomous driving exemplifies how ML systems can span the entire spectrum from embedded systems to cloud infrastructure. This case study demonstrates the practical implementation of complex ML systems in a safety-critical, real-world environment, integrating real-time decision-making with long-term learning and adaptation. -The training section covers efficient AI training techniques, model optimizations, and AI acceleration using specialized hardware. Deployment is addressed next, with chapters on benchmarking AI, distributed learning, and ML operations. Advanced topics like security, privacy, responsible AI, sustainable AI, robust AI, and generative AI are then explored in depth. The book concludes by highlighting the positive impact of AI and its potential for good. +**Data Aspects** -## How to Navigate This Book +The data ecosystem underpinning Waymo's technology is vast and dynamic. Each vehicle serves as a roving data center, its sensor suite—comprising LiDAR, radar, and high-resolution cameras—generating approximately one terabyte of data per hour of driving. This real-world data is complemented by an even more extensive simulated dataset, with Waymo's vehicles having traversed over 20 billion miles in simulation and more than 20 million miles on public roads. The challenge lies not just in the volume of data, but in its heterogeneity and the need for real-time processing. Waymo must handle both structured (e.g., GPS coordinates) and unstructured data (e.g., camera images) simultaneously. The data pipeline spans from edge processing on the vehicle itself to massive cloud-based storage and processing systems. Sophisticated data cleaning and validation processes are crucial, given the safety-critical nature of the application. Moreover, the representation of the vehicle's environment in a form amenable to machine learning presents significant challenges, requiring complex preprocessing to convert raw sensor data into meaningful features that capture the dynamics of traffic scenarios. -To get the most out of this book, we recommend a structured learning approach that leverages the various resources provided. Each chapter includes slides, videos, exercises, and labs to cater to different learning styles and reinforce your understanding. +**Algorithm/Model Aspects** -1. **Fundamentals (Chapters 1-3):** Start by building a strong foundation with the initial chapters, which provide an introduction to AI and cover core topics like AI systems and deep learning. +Waymo's ML stack represents a sophisticated ensemble of algorithms tailored to the multifaceted challenge of autonomous driving. The perception system employs deep learning techniques, including convolutional neural networks, to process visual data for object detection and tracking. Prediction models, crucial for anticipating the behavior of other road users, leverage recurrent neural networks to understand temporal sequences. Waymo has developed custom ML models like VectorNet for predicting vehicle trajectories. The planning and decision-making systems may incorporate reinforcement learning or imitation learning techniques to navigate complex traffic scenarios. A key innovation in Waymo's approach is the integration of these diverse models into a coherent system capable of real-time operation. The ML models must also be interpretable to some degree, as understanding the reasoning behind a vehicle's decisions is crucial for safety and regulatory compliance. Waymo's learning process involves continuous refinement based on real-world driving experiences and extensive simulation, creating a feedback loop that constantly improves the system's performance. -2. **Workflow (Chapters 4-6):** With that foundation, move on to the chapters focused on practical aspects of the AI model building process like workflows, data engineering, and frameworks. +**Computing Infrastructure Aspects** -3. **Training (Chapters 7-10):** These chapters offer insights into effectively training AI models, including techniques for efficiency, optimizations, and acceleration. +The computing infrastructure supporting Waymo's autonomous vehicles epitomizes the challenges of deploying ML systems across the full spectrum from edge to cloud. Each vehicle is equipped with a custom-designed compute platform capable of processing sensor data and making decisions in real-time, often leveraging specialized hardware like GPUs or custom AI accelerators. This edge computing is complemented by extensive use of cloud infrastructure, leveraging the power of Google's data centers for training models, running large-scale simulations, and performing fleet-wide learning. The connectivity between these tiers is crucial, with vehicles requiring reliable, high-bandwidth communication for real-time updates and data uploading. Waymo's infrastructure must be designed for robustness and fault tolerance, ensuring safe operation even in the face of hardware failures or network disruptions. The scale of Waymo's operation presents significant challenges in data management, model deployment, and system monitoring across a geographically distributed fleet of vehicles. -4. **Deployment (Chapters 11-13):** Learn about deploying AI on devices and monitoring the operationalization through methods like benchmarking, on-device learning, and MLOps. +**Impact and Future Implications** -5. **Advanced Topics (Chapters 14-18):** Critically examine topics like security, privacy, ethics, sustainability, robustness, and generative AI. +Waymo's impact extends beyond technological advancement, potentially revolutionizing transportation, urban planning, and numerous aspects of daily life. The launch of Waymo One, a commercial ride-hailing service using autonomous vehicles in Phoenix, Arizona, represents a significant milestone in the practical deployment of AI systems in safety-critical applications. Waymo's progress has broader implications for the development of robust, real-world AI systems, driving innovations in sensor technology, edge computing, and AI safety that have applications far beyond the automotive industry. However, it also raises important questions about liability, ethics, and the interaction between AI systems and human society. As Waymo continues to expand its operations and explore applications in trucking and last-mile delivery, it serves as a crucial test bed for advanced ML systems, driving progress in areas such as continual learning, robust perception, and human-AI interaction. The Waymo case study underscores both the tremendous potential of ML systems to transform industries and the complex challenges involved in deploying AI in the real world. -6. **Social Impact (Chapter 19):** Explore the positive applications and potential of AI for societal good. +## Challenges and Considerations -7. **Conclusion (Chapter 20):** Reflect on the key takeaways and future directions in AI systems. +Building and deploying machine learning systems presents unique challenges that go beyond traditional software development. These challenges help explain why creating effective ML systems is about more than just choosing the right algorithm or collecting enough data. Let's explore the key areas where ML practitioners face significant hurdles. -While the book is designed for progressive learning, we encourage an interconnected learning approach that allows you to navigate chapters based on your interests and needs. Throughout the book, you'll find case studies and hands-on exercises that help you relate theory to real-world applications. We also recommend participating in forums and groups to engage in [discussions](https://github.com/harvard-edge/cs249r_book/discussions), debate concepts, and share insights with fellow learners. Regularly revisiting chapters can help reinforce your learning and offer new perspectives on the concepts covered. By adopting this structured yet flexible approach and actively engaging with the content and the community, you'll embark on a fulfilling and enriching learning experience that maximizes your understanding. +### Data Challenges -## Chapter-by-Chapter Insights +The foundation of any ML system is its data, and managing this data introduces several fundamental challenges. First, there's the basic question of data quality - real-world data is often messy and inconsistent. Imagine a healthcare application that needs to process patient records from different hospitals. Each hospital might record information differently, use different units of measurement, or have different standards for what data to collect. Some records might have missing information, while others might contain errors or inconsistencies that need to be cleaned up before the data can be useful. -Here's a closer look at what each chapter covers. We have structured the book into six main sections: Fundamentals, Workflow, Training, Deployment, Advanced Topics, and Impact. These sections closely reflect the major components of a typical machine learning pipeline, from understanding the basic concepts to deploying and maintaining AI systems in real-world applications. By organizing the content in this manner, we aim to provide a logical progression that mirrors the actual process of developing and implementing AI systems. +As ML systems grow, they often need to handle increasingly large amounts of data. A video streaming service like Netflix, for example, needs to process billions of viewer interactions to power its recommendation system. This scale introduces new challenges in how to store, process, and manage such large datasets efficiently. -### Fundamentals +Another critical challenge is how data changes over time. This phenomenon, known as "data drift," occurs when the patterns in new data begin to differ from the patterns the system originally learned from. For example, many predictive models struggled during the COVID-19 pandemic because consumer behavior changed so dramatically that historical patterns became less relevant. ML systems need ways to detect when this happens and adapt accordingly. -In the Fundamentals section, we lay the groundwork for understanding AI. This is far from being a thorough deep dive into the algorithms, but we aim to introduce key concepts, provide an overview of machine learning systems, and dive into the principles and algorithms of deep learning that power AI applications in their associated systems. This section equips you with the essential knowledge needed to grasp the subsequent chapters. +### Model Challenges -1. **[Introduction:](../introduction/introduction.qmd)** This chapter sets the stage, providing an overview of AI and laying the groundwork for the chapters that follow. -2. **[ML Systems:](../ml_systems/ml_systems.qmd)** We introduce the basics of machine learning systems, the platforms where AI algorithms are widely applied. -3. **[Deep Learning Primer:](../dl_primer/dl_primer.qmd)** This chapter offers a brief introduction to the algorithms and principles that underpin AI applications in ML systems. +Creating and maintaining the ML models themselves presents another set of challenges. Modern ML models, particularly in deep learning, can be extremely complex. Consider a language model like GPT-3, which has hundreds of billions of parameters (the individual settings the model learns during training). This complexity creates practical challenges: these models require enormous computing power to train and run, making it difficult to deploy them in situations with limited resources, like on mobile phones or IoT devices. -### Workflow +Training these models effectively is itself a significant challenge. Unlike traditional programming where we write explicit instructions, ML models learn from examples. This learning process involves many choices: How should we structure the model? How long should we train it? How can we tell if it's learning the right things? Making these decisions often requires both technical expertise and considerable trial and error. -The Workflow section guides you through the practical aspects of building AI models. We break down the AI workflow, discuss data engineering best practices, and review popular AI frameworks. By the end of this section, you'll have a clear understanding of the steps involved in developing proficient AI applications and the tools available to streamline the process. +A particularly important challenge is ensuring that models work well in real-world conditions. A model might perform excellently on its training data but fail when faced with slightly different situations in the real world. This gap between training performance and real-world performance is a central challenge in machine learning, especially for critical applications like autonomous vehicles or medical diagnosis systems. -4. **[AI Workflow:](../workflow/workflow.qmd)** This chapter breaks down the machine learning workflow, offering insights into the steps leading to proficient AI applications. -5. **[Data Engineering:](../data_engineering/data_engineering.qmd)** We focus on the importance of data in AI systems, discussing how to effectively manage and organize data. -6. **[AI Frameworks:](../frameworks/frameworks.qmd)** This chapter reviews different frameworks for developing machine learning models, guiding you in choosing the most suitable one for your projects. +### System Challenges -### Training +Getting ML systems to work reliably in the real world introduces its own set of challenges. Unlike traditional software that follows fixed rules, ML systems need to handle uncertainty and variability in their inputs and outputs. They also typically need both training systems (for learning from data) and serving systems (for making predictions), each with different requirements and constraints. -In the Training section, we explore techniques for training efficient and reliable AI models. We cover strategies for achieving efficiency, model optimizations, and the role of specialized hardware in AI acceleration. This section empowers you with the knowledge to develop high-performing models that can be seamlessly integrated into AI systems. +Consider a company building a speech recognition system. They need infrastructure to collect and store audio data, systems to train models on this data, and then separate systems to actually process users' speech in real-time. Each part of this pipeline needs to work reliably and efficiently, and all the parts need to work together seamlessly. -7. **[AI Training:](../training/training.qmd)** This chapter explores model training, exploring techniques for developing efficient and reliable models. -8. **[Efficient AI:](../efficient_ai/efficient_ai.qmd)** Here, we discuss strategies for achieving efficiency in AI applications, from computational resource optimization to performance enhancement. -9. **[Model Optimizations:](../optimizations/optimizations.qmd)** We explore various avenues for optimizing AI models for seamless integration into AI systems. -10. **[AI Acceleration:](../hw_acceleration/hw_acceleration.qmd)** We discuss the role of specialized hardware in enhancing the performance of AI systems. +These systems also need constant monitoring and updating. How do we know if the system is working correctly? How do we update models without interrupting service? How do we handle errors or unexpected inputs? These operational challenges become particularly complex when ML systems are serving millions of users. -### Deployment +### Ethical and Social Considerations -The Deployment section focuses on the challenges and solutions for deploying AI models. We discuss benchmarking methods to evaluate AI system performance, techniques for on-device learning to improve efficiency and privacy, and the processes involved in ML operations. This section equips you with the skills to effectively deploy and maintain AI functionalities in AI systems. +As ML systems become more prevalent in our daily lives, their broader impacts on society become increasingly important to consider. One major concern is fairness - ML systems can sometimes learn to make decisions that discriminate against certain groups of people. This often happens unintentionally, as the systems pick up biases present in their training data. For example, a job application screening system might inadvertently learn to favor certain demographics if those groups were historically more likely to be hired. -11. **[Benchmarking AI:](../benchmarking/benchmarking.qmd)** This chapter focuses on how to evaluate AI systems through systematic benchmarking methods. -12. **[On-Device Learning:](../ondevice_learning/ondevice_learning.qmd)** We explore techniques for localized learning, which enhances both efficiency and privacy. -13. **[ML Operations:](../ops/ops.qmd)** This chapter looks at the processes involved in the seamless integration, monitoring, and maintenance of AI functionalities. +Another crucial consideration is transparency. Many modern ML models, particularly deep learning models, work as "black boxes" - while they can make predictions, it's often difficult to understand how they arrived at their decisions. This becomes particularly problematic when ML systems are making important decisions about people's lives, such as in healthcare or financial services. -### Advanced Topics +Privacy is also a major concern. ML systems often need large amounts of data to work effectively, but this data might contain sensitive personal information. How do we balance the need for data with the need to protect individual privacy? How do we ensure that models don't inadvertently memorize and reveal private information? -In the Advanced Topics section, We will study the critical issues surrounding AI. We address privacy and security concerns, explore the ethical principles of responsible AI, discuss strategies for sustainable AI development, examine techniques for building robust AI models, and introduce the exciting field of generative AI. This section broadens your understanding of the complex landscape of AI and prepares you to navigate its challenges. +These challenges aren't merely technical problems to be solved, but ongoing considerations that shape how we approach ML system design and deployment. Throughout this book, we'll explore these challenges in detail and examine strategies for addressing them effectively. -14. **[Security & Privacy:](../privacy_security/privacy_security.qmd)** As AI becomes more ubiquitous, this chapter addresses the crucial aspects of privacy and security in AI systems. -15. **[Responsible AI:](../responsible_ai/responsible_ai.qmd)** We discuss the ethical principles guiding the responsible use of AI, focusing on fairness, accountability, and transparency. -16. **[Sustainable AI:](../sustainable_ai/sustainable_ai.qmd)** This chapter explores practices and strategies for sustainable AI, ensuring long-term viability and reduced environmental impact. -17. **[Robust AI:](../robust_ai/robust_ai.qmd)** We discuss techniques for developing reliable and robust AI models that can perform consistently across various conditions. -18. **[Generative AI:](../generative_ai/generative_ai.qmd)** This chapter explores the algorithms and techniques behind generative AI, opening avenues for innovation and creativity. +## Future Directions -### Social Impact +As we look to the future of machine learning systems, several exciting trends are shaping the field. These developments promise to both solve existing challenges and open new possibilities for what ML systems can achieve. -The Impact section highlights the transformative potential of AI in various domains. We showcase real-world applications of TinyML in healthcare, agriculture, conservation, and other areas where AI is making a positive difference. This section inspires you to leverage the power of AI for societal good and to contribute to the development of impactful solutions. +One of the most significant trends is the democratization of AI technology. Just as personal computers transformed computing from specialized mainframes to everyday tools, ML systems are becoming more accessible to developers and organizations of all sizes. Cloud providers now offer pre-trained models and automated ML platforms that reduce the expertise needed to deploy AI solutions. This democratization is enabling new applications across industries, from small businesses using AI for customer service to researchers applying ML to previously intractable problems. -19. **[AI for Good:](../ai_for_good/ai_for_good.qmd)** We highlight positive applications of TinyML in areas like healthcare, agriculture, and conservation. +As concerns about computational costs and environmental impact grow, there's an increasing focus on making ML systems more efficient. Researchers are developing new techniques for training models with less data and computing power. Innovation in specialized hardware, from improved GPUs to custom AI chips, is making ML systems faster and more energy-efficient. These advances could make sophisticated AI capabilities available on more devices, from smartphones to IoT sensors. -### Closing +Perhaps the most transformative trend is the development of more autonomous ML systems that can adapt and improve themselves. These systems are beginning to handle their own maintenance tasks - detecting when they need retraining, automatically finding and correcting errors, and optimizing their own performance. This automation could dramatically reduce the operational overhead of running ML systems while improving their reliability. -In the Closing section, we reflect on the key learnings from the book and look ahead to the future of AI. We synthesize the concepts covered, discuss emerging trends, and provide guidance on continuing your learning journey in this rapidly evolving field. This section leaves you with a comprehensive understanding of AI and the excitement to apply your knowledge in innovative ways. +While these trends are promising, it's important to recognize the field's limitations. Creating truly artificial general intelligence remains a distant goal. Current ML systems excel at specific tasks but lack the flexibility and understanding that humans take for granted. Challenges around bias, transparency, and privacy continue to require careful consideration. As ML systems become more prevalent, addressing these limitations while leveraging new capabilities will be crucial. -20. **[Conclusion:](../conclusion/conclusion.qmd)** The book concludes with a reflection on the key learnings and future directions in the field of AI. +## Learning Path and Book Structure -### Tailored Learning +This book is designed to guide you from understanding the fundamentals of ML systems to effectively designing and implementing them. To address the complexities and challenges of Machine Learning Systems engineering, we've organized the content around five fundamental pillars that encompass the lifecycle of ML systems. These pillars provide a framework for understanding, developing, and maintaining robust ML systems. -We understand that readers have diverse interests; some may wish to grasp the fundamentals, while others are eager to delve into advanced topics like hardware acceleration or AI ethics. To help you navigate the book more effectively, we've created a persona-based reading guide tailored to your specific interests and goals. This guide assists you in identifying the reader persona that best matches your interests. Each persona represents a distinct reader profile with specific objectives. By selecting the persona that resonates with you, you can focus on the chapters and sections most relevant to your needs. +![Overview of the five fundamental system pillars of Machine Learning Systems engineering.](images/png/book_pillars.png){#fig-pillars} -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| Persona | Description | Relevant Chapters or Sections | Focus | -+:=======================+:=========================================================================+:==============================================+:==========================================================================================================+ -| The TinyML Newbie | You are new to the field of TinyML and eager to learn the basics. | Chapters 1-3, 8, 9, 10, 12 | Understand the fundamentals, gain insights into efficient and optimized ML, | -| | | | and learn about on-device learning. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| The EdgeML Enthusiast | You have some TinyML knowledge and are interested in exploring | Chapters 1-3, 8, 9, 10, 12, 13 | Build a strong foundation, delve into the intricacies of efficient ML, | -| | the broader world of EdgeML. | | and explore the operational aspects of embedded systems. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| The Computer Visionary | You are fascinated by computer vision and its applications in TinyML | Chapters 1-3, 5, 8-10, 12, 13, 17, 20 | Start with the basics, explore data engineering, and study methods for optimizing ML | -| | and EdgeML. | | models. Learn about robustness and the future of ML systems. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| The Data Maestro | You are passionate about data and its crucial role in ML systems. | Chapters 1-5, 8-13 | Gain a comprehensive understanding of data's role in ML systems, explore the ML | -| | | | workflow, and dive into model optimization and deployment considerations. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| The Hardware Hero | You are excited about the hardware aspects of ML systems and how | Chapters 1-3, 6, 8-10, 12, 14, 17, 20 | Build a solid foundation in ML systems and frameworks, explore challenges of | -| | they impact model performance. | | optimizing models for efficiency, hardware-software co-design, and security aspects. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| The Sustainability | You are an advocate for sustainability and want to learn how to | Chapters 1-3, 8-10, 12, 15, 16, 20 | Begin with the fundamentals of ML systems and TinyML, explore model optimization | -| Champion | develop eco-friendly AI systems. | | techniques, and learn about responsible and sustainable AI practices. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| The AI Ethicist | You are concerned about the ethical implications of AI and want to | Chapters 1-3, 5, 7, 12, 14-16, 19, 20 | Gain insights into the ethical considerations surrounding AI, including fairness, | -| | ensure responsible development and deployment. | | privacy, sustainability, and responsible development practices. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ -| The Full-Stack ML | You are a seasoned ML expert and want to deepen your understanding | The entire book | Understand the end-to-end process of building and deploying ML systems, from data | -| Engineer | of the entire ML system stack. | | engineering and model optimization to hardware acceleration and ethical considerations. | -+------------------------+--------------------------------------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------------------+ +As illustrated in Figure @fig-pillars, the five pillars central to the framework are: +- **Data**: Emphasizing data engineering and foundational principles critical to how AI operates in relation to data. +- **Training**: Exploring the methodologies for AI training, focusing on efficiency, optimization, and acceleration techniques to enhance model performance. +- **Deployment**: Encompassing benchmarks, on-device learning strategies, and machine learning operations to ensure effective model application. +- **Operations**: Highlighting the maintenance challenges unique to machine learning systems, which require specialized approaches distinct from traditional engineering systems. +- **Ethics & Governance**: Addressing concerns such as security, privacy, responsible AI practices, and the broader societal implications of AI technologies. -## Join the Community +Each pillar represents a critical phase in the lifecycle of ML systems and is composed of foundational elements that build upon each other. This structure ensures a comprehensive understanding of MLSE, from basic principles to advanced applications and ethical considerations. -Learning in the fast-paced world of AI is a collaborative journey. We set out to nurture a vibrant community of learners, innovators, and contributors. As you explore the concepts and engage with the exercises, we encourage you to share your insights and experiences. Whether it's a novel approach, an interesting application, or a thought-provoking question, your contributions can enrich the learning ecosystem. Engage in discussions, offer and seek guidance, and collaborate on projects to foster a culture of mutual growth and learning. By sharing knowledge, you play an important role in fostering a globally connected, informed, and empowered community. +For more detailed information about the book's overview, contents, learning outcomes, target audience, prerequisites, and navigation guide, please refer to the [About the Book](../../about.qmd) section. There, you'll also find valuable details about our learning community and how to maximize your experience with this resource. \ No newline at end of file