Skip to content

Commit

Permalink
Updated the captions, and fixed the text
Browse files Browse the repository at this point in the history
  • Loading branch information
profvjreddi committed Nov 25, 2023
1 parent 5821050 commit dd196f8
Showing 1 changed file with 17 additions and 21 deletions.
38 changes: 17 additions & 21 deletions data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,8 @@ Creating useful, ethical training data requires holistic consideration of privac

We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don’t have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!




## Problem Definition

In many domains of machine learning, while sophisticated algorithms take center stage, the fundamental importance of data quality is often overlooked. This neglect gives rise to [“Data Cascades”](https://research.google/pubs/pub49953/) — events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities.

![A visual representation of the stages in the machine learning pipeline and the potential pitfalls, illustrating how data quality lapses can lead to cascading negative consequences throughout the process.](images/data_engineering_cascades.png)
Expand All @@ -55,7 +53,6 @@ Moreover, many of the current KWS voice assistants support a limited number of l

This level of accuracy and robustness hinges on the availability of data, quality of data, ability to label the data correctly, and ensuring transparency of the data for the end user—all before the data is used to train the model. But it all begins with a clear understanding of the problem statement or definition.


Generally, in ML, problem definition has a few key steps:

1. Identifying the problem definition clearly
Expand All @@ -70,7 +67,6 @@ Generally, in ML, problem definition has a few key steps:

6. Followed by finally doing the data collection.


Laying a solid foundation for a project is essential for its trajectory and eventual success. Central to this foundation is first identifying a clear problem, such as ensuring that voice commands in voice assistance systems are recognized consistently across varying environments. Clear objectives, like creating representative datasets for diverse scenarios, provide a unified direction. Benchmarks, such as system accuracy in keyword detection, offer measurable outcomes to gauge progress. Engaging with stakeholders, from end-users to investors, provides invaluable insights and ensures alignment with market needs. Additionally, when delving into areas like voice assistance, understanding platform constraints is pivotal. Embedded systems, such as microcontrollers, come with inherent limitations in processing power, memory, and energy efficiency. Recognizing these limitations ensures that functionalities, like keyword detection, are tailored to operate optimally, balancing performance with resource conservation.

In this context, using KWS as an example, we can break each of the steps out as follows:
Expand Down Expand Up @@ -114,23 +110,21 @@ In this context, using KWS as an example, we can break each of the steps out as
7. **Iterative Feedback and Refinement:**
Once a prototype KWS system is developed, it's crucial to test it in real-world scenarios, gather feedback, and iteratively refine the model. This ensures that the system remains aligned with the defined problem and objectives. This is important because the deployment scenarios change over time as things evolve.




## Data Sourcing

The quality and diversity of data gathered is important for developing accurate and robust AI systems. Sourcing high-quality training data requires careful consideration of the objectives, resources, and ethical implications. Data can be obtained from various sources depending on the needs of the project:

### Pre-existing datasets

Platforms like [Kaggle](https://www.kaggle.com/) and [UCI Machine Learning Repository](https://archive.ics.uci.edu/) provide a convenient starting point. Pre-existing datasets are a valuable resource for researchers, developers, and businesses alike. One of their primary advantages is cost-efficiency. Creating a dataset from scratch can be both time-consuming and expensive, so having access to ready-made data can save significant resources. Moreover, many of these datasets, like [ImageNet](https://www.image-net.org/), have become standard benchmarks in the machine learning community, allowing for consistent performance comparisons across different models and algorithms. This availability of data means that experiments can be started immediately without any delays associated with data collection and preprocessing. In a fast moving field like ML, this expediency is important.

The quality assurance that comes with popular pre-existing datasets is important to consider because several datasets have errors in them. For instance, [the ImageNet dataset was found to have over 6.4% errors](https://arxiv.org/abs/2103.14749). Given their widespread use, any errors or biases in these datasets are often identified and rectified by the community. This assurance is especially beneficial for students and newcomers to the field, as they can focus on learning and experimentation without worrying about data integrity. Supporting documentation that often accompanies existing datasets is invaluable, though this generally applies only to widely used datasets. Good documentation provides insights into the data collection process, variable definitions, and sometimes even offers baseline model performances. This information not only aids understanding but also promotes reproducibility in research, a cornerstone of scientific integrity; currently there is a crisis around [improving reproducibility in machine learning systems](https://arxiv.org/abs/2003.12206). When other researchers have access to the same data, they can validate findings, test new hypotheses, or apply different methodologies, thus allowing us to build on each other's work more rapidly.

While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, it's essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes these [datasets do not reflect the real-world data](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/).
While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, it's essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes these [datasets do not reflect the real-world data](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/).

In addition, bias, validity, and reproducibility issues may exist in these datasets and in recent years there is a growing awareness of these issues.
In addition, bias, validity, and reproducibility issues may exist in these datasets and in recent years there is a growing awareness of these issues. Furthermore, using the same dataset to train different models as shown in the figure below can sometimes create misalignment, where the models do not accurately reflect the real world.

![Using the same dataset to train different models for different tasks can create a misalignment, where the models do not accurately reflect the real world. Copyrights: all icons are borrowed from Flaticon:
Data center icon (by: smashingstock); Neural network icons (from left to right, by: Becris; Freepik; Freepik; Paul J; SBTS2018)](images/data_engineering/dataset_myopia.png)
![Training different models from the same dataset. Neural network icons (from left to right, by: Becris; Freepik; Freepik; Paul J; SBTS2018)](images/data_engineering/dataset_myopia.png)

### Web Scraping

Expand All @@ -142,15 +136,19 @@ Beyond computer vision, web scraping supports the gathering of textual data for

Web scraping can also collect structured data like stock prices, weather data, or product information for analytical applications. Once data is scraped, it is essential to store it in a structured manner, often using databases or data warehouses. Proper data management ensures the usability of the scraped data for future analysis and applications.

However, while web scraping offers numerous advantages, there are significant limitations and ethical considerations to bear in mind. Not all websites permit scraping, and violating these restrictions can lead to legal repercussions. It is also unethical and potentially illegal to scrape copyrighted material or private communications. Ethical web scraping mandates adherence to a website's 'robots.txt' file, which outlines the sections of the site that can be accessed and scraped by automated bots. To deter automated scraping, many websites implement rate limits. If a bot sends too many requests in a short period, it might be temporarily blocked, restricting the speed of data access. Additionally, the dynamic nature of web content means that data scraped at different intervals might lack consistency, posing challenges for longitudinal studies. Though there are emerging trends like [Web Navigation](https://arxiv.org/abs/1812.09195) where machine learning algorithms can automatically navigate the website to access the dynamic content.
However, while web scraping offers numerous advantages, there are significant limitations and ethical considerations to bear in mind. Not all websites permit scraping, and violating these restrictions can lead to legal repercussions. It is also unethical and potentially illegal to scrape copyrighted material or private communications. Ethical web scraping mandates adherence to a website's 'robots.txt' file, which outlines the sections of the site that can be accessed and scraped by automated bots.

To deter automated scraping, many websites implement rate limits. If a bot sends too many requests in a short period, it might be temporarily blocked, restricting the speed of data access. Additionally, the dynamic nature of web content means that data scraped at different intervals might lack consistency, posing challenges for longitudinal studies. Though there are emerging trends like [Web Navigation](https://arxiv.org/abs/1812.09195) where machine learning algorithms can automatically navigate the website to access the dynamic content.

For niche subjects, the volume of pertinent data available for scraping might be limited. For example, while scraping for common topics like images of cats and dogs might yield abundant data, searching for rare medical conditions might not be as fruitful. Moreover, the data obtained through scraping is often unstructured and noisy, necessitating thorough preprocessing and cleaning. It is crucial to understand that not all scraped data will be of high quality or accuracy. Employing verification methods, such as cross-referencing with alternate data sources, can enhance data reliability.

Privacy concerns arise when scraping personal data, emphasizing the need for anonymization. Therefore, it is paramount to adhere to a website's Terms of Service, confine data collection to public domains, and ensure the anonymity of any personal data acquired.

While web scraping can be a scalable method to amass large training datasets for AI systems, its applicability is confined to specific data types. For example, sourcing data for Inertial Measurement Units (IMU) for gesture recognition is not straightforward through web scraping. At most, one might be able to scrape an existing dataset.

![Web scraping can yield inconsistent or inaccurate data. For example, this photo, from a Vox article, shows up when you search 'traffic light' on Google images. It is an image from 1914 that shows outdated traffic lights, which are also barely discernable because of the image's poor quality.](images/data_engineering/1914_traffic.jpeg)
Web scraping can yield inconsistent or inaccurate data. For example, the photo below shows up when you search 'traffic light' on Google images. It is an image from 1914 that shows outdated traffic lights, which are also barely discernable because of the image's poor quality.

![The first traffic lights were installed in 1914, and a Google search for the keywords 'traffic light' may yield results related to them. This can be problematic for web-scraped datasets, as it pollutes the dataset with inapplicable data samples. Source: [Vox](https://www.vox.com/2015/8/5/9097713/when-was-the-first-traffic-light-installed)](images/data_engineering/1914_traffic.jpeg)

### Crowdsourcing

Expand Down Expand Up @@ -337,9 +335,9 @@ and therefore enabling reproducibility.

- *Merges:* Merges help to integrate changes from different branches while maintaining the integrity of the data.

With data version control in place, we are able to track the changes as shown below, reproduce previous results by reverting to older versions, and collaborate safely by branching off and isolating the changes.

![With data version control in place, we are able to track the changes, reproduce previous results by reverting to older versions, and collaborate safely by branching off and isolating the changes.](images/data_engineering/data_version_ctrl.png)

![Similar to code versioning, data versioning can help us track changes and roll back dataset updates.](images/data_engineering/data_version_ctrl.png)

**Popular Data Version Control Systems**

Expand All @@ -349,10 +347,8 @@ and therefore enabling reproducibility.

**[[Git LFS]{.underline}](https://git-lfs.com/):** It is useful for data version control on smaller sized datasets. It uses Git's inbuilt branching and merging features, but is limited in terms of tracking metrics, reverting to previous versions or integration with data lakes.


## Optimizing Data for Embedded AI


Creators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for unusually specific use cases, requiring heavy filtering of datasets. While other natural language models may be capable of turning any speech to text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data because they do not address the task of interest. Additionally, an embedded AI system may be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle.

On the other hand, embedded AI systems are often expected to provide especially accurate performance in unpredictable real-world settings. This may lead creators to design datasets specifically to represent variations in potential inputs and promote model robustness. As a result, they may define a narrow scope for their project but then aim for deep coverage within those bounds. For example, creators of the doorbell model mentioned above might try to cover variations in data arising from:
Expand All @@ -364,13 +360,13 @@ On the other hand, embedded AI systems are often expected to provide especially

As described above, creators may consider crowdsourcing or synthetically generating data to include these different kinds of variations.


## Data Transparency


By providing clear, detailed documentation, creators can help developers understand how best to use their datasets. Several groups have suggested standardized documentation formats for datasets, such as Data Cards (@Pushkarna_Zaldivar_Kjartansson_2022), datasheets (@Gebru_Morgenstern_Vecchione_Vaughan_Wallach_III_Crawford_2021), data statements (@Bender_Friedman_2018), or Data Nutrition Labels (@Holland_Hosny_Newman_Joseph_Chmielinski_2020). When releasing a dataset, creators may describe what kinds of data they collected, how they collected and labeled it, and what kinds of use cases may be a good or poor fit for the dataset. Quantitatively, it may be appropriate to provide a breakdown of how well the dataset represents different groups (e.g. different gender groups, different cameras).

![This is an example of a data card for a computer vision dataset. It includes some basic information about the dataset and instructions on how to use or not to use the dataset, including known biases. Copyrights: (@Pushkarna_Zaldivar_Kjartansson_2022)](images/data_engineering/data_card.png)
Below is an example of a data card for a computer vision dataset. It includes some basic information about the dataset and instructions on how to use or not to use the dataset, including known biases.

![Data card describing a text dataset. Source: (@Pushkarna_Zaldivar_Kjartansson_2022)](images/data_engineering/data_card.png)

Keeping track of data provenance—essentially the origins and the journey of each data point through the data pipeline—is not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesn’t just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the model’s performance and its acceptability among end-users.

Expand Down

0 comments on commit dd196f8

Please sign in to comment.