Skip to content

Commit

Permalink
Updated data_engineering.qmd
Browse files Browse the repository at this point in the history
  • Loading branch information
mpstewart1 committed Oct 10, 2023
1 parent fa31424 commit 105fe77
Showing 1 changed file with 1 addition and 142 deletions.
143 changes: 1 addition & 142 deletions data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -383,151 +383,10 @@ In some cases, certain portions of a dataset may need to be removed or obscured

Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.

For instance, below is an example request from Common Voice users to remove their information:

+-----------------------------------------------------------------------+
| Thank you for downloading the Common Voice dataset. Account holders |
| are free to request deletion of their voice clips at any time. We |
| action this on our side for all future releases and are legally |
| obligated to inform those who have downloaded a historic release so |
| that they can also take action. |
| |
| You are receiving this message because one or more account holders |
| have requested that their voice clips be deleted. Their clips are |
| part of the dataset that you downloaded and are associated with the |
| hashed IDs listed below. Please delete them from your downloads in |
| order to fulfill your third party data privacy obligations. |
| |
| Thank you for your timely completion. |
| |
| - 4497f1df0c6c4e647fa4354ad07a40075cc95a210dafce49ce0c35cd252 |
| e4ec0fad1034e0cc3af869499e6f60ce315fe600ee2e9188722de906f909a21e0ee57 |
| |
| - 97a8f0a1df086bd5f76343f5f4a511ae39ec98256a0ca48de5c54bc5771 |
| d8c8e32283a11056147624903e9a3ac93416524f19ce0f9789ce7eef2262785cf3af7 |
| |
| - 969ea94ac5e20bdd7a098747f5dc2f6d203f6b659c0c3b6257dc790dc34 |
| d27ac3f2fafb3910f1ec8d7ebea38c120d4b51688047e352baa957cc35f0f5c69b112 |
| |
| - 6b5460779f644ad39deffeab6edf939547f206596089d554984abff3d36 |
| a4ecc06e66870958e62299221c09af8cd82864c626708371d72297eaea5955d8e46a9 |
| |
| - 33275ff207a27708bd1187ff950888da592cac507e01e922c4b9a07d3f6 |
| c2c3fe2ade429958c3702294f446bfbad8c4ebfefebc9e157d358ccc6fcf5275e7564 |
+=======================================================================+
+-----------------------------------------------------------------------+

Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample\'s effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That\'s a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate[^8]^,^[^9]^,^[^10] its impact on the model\'s behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.

Dataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering.

## Conclusion

Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications.


## Helpful References
1\. \[3 big problems with datasets in AI and machine
learning\](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/)

2\. \[Common Voice: A Massively-Multilingual Speech
Corpus\](https://arxiv.org/abs/1912.06670)

3\. \[Data Engineering for Everyone\](https://arxiv.org/abs/2102.11447)

4\. \[DataPerf: Benchmarks for Data-Centric AI
Development\](https://arxiv.org/abs/2207.10062)

5\. \[Deep Spoken Keyword Spotting: An
Overview\](https://arxiv.org/abs/2111.10592)

6\. \["Everyone wants to do the model work, not the data work": Data
Cascades in High-Stakes AI\](https://research.google/pubs/pub49953/)

7\. \[Improving Reproducibility in Machine Learning Research (A Report
from the NeurIPS 2019 Reproducibility
Program)\](https://arxiv.org/abs/2003.12206)

8\.
\[LabelMe\](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf)

9\. \[Model Cards for Model
Reporting\](https://arxiv.org/abs/1810.03993)

10\. \[Multilingual Spoken Words
Corpus\](https://openreview.net/pdf?id=c20jiJ5K2H)

11\.
\[OpenImages\](https://storage.googleapis.com/openimages/web/index.html)

12\. \[Pervasive Label Errors in Test Sets Destabilize Machine Learning
Benchmarks\](https://arxiv.org/abs/2103.14749)

13\. \[Small-footprint keyword spotting using deep neural
networks\](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)

14\. \[SpecAugment: A Simple Data Augmentation Method for Automatic
Speech Recognition\](https://arxiv.org/abs/1904.08779)

[^1]: Janssen, Marijn, et al. \"Data governance: Organizing data for
trustworthy Artificial Intelligence.\" *Government Information
Quarterly* 37.3 (2020): 101493.

[^2]: Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. \"Mel Frequency
Cepstral Coefficient and its applications: A Review.\" *IEEE Access*
(2022).

[^3]: Vasuki, A., and P. T. Vanathi. \"A review of vector quantization
techniques.\" *IEEE Potentials* 25.4 (2006): 39-47.

[^4]: Rybakov, Oleg, et al. \"Streaming keyword spotting on mobile
devices.\" *arXiv preprint arXiv:2005.06720* (2020).

[^5]: <https://ieeexplore.ieee.org/document/9153560>

[^6]: Birhane, Abeba, and Vinay Uday Prabhu. \"Large image datasets: A
pyrrhic win for computer vision?.\" *2021 IEEE Winter Conference on
Applications of Computer Vision (WACV)*. IEEE, 2021.

[^7]: Sonnenburg, Soren, et al. \"The need for open source software in
machine learning.\" (2007): 2443-2466.

[^8]: Ginart, Antonio, et al. \"Making ai forget you: Data deletion in
machine learning.\" *Advances in neural information processing
systems* 32 (2019).

[^9]: Sekhari, Ayush, et al. \"Remember what you want to forget:
Algorithms for machine unlearning.\" *Advances in Neural Information
Processing Systems* 34 (2021): 18075-18086.

[^10]: Guo, Chuan, et al. \"Certified data removal from machine learning
models.\" *arXiv preprint arXiv:1911.03030* (2019).

<!--
[^1]: @inproceedings{dwork2006differential,
title={Differential privacy},
author={Dwork, Cynthia},
booktitle={International colloquium on automata, languages, and programming},
pages={1--12},
year={2006},
organization={Springer}
}
[^2]: @inproceedings{buolamwini2018gender,
title={Gender shades: Intersectional accuracy disparities in commercial gender classification},
author={Buolamwini, Joy and Gebru, Timnit},
booktitle={Conference on fairness, accountability and transparency},
pages={77--91},
year={2018},
organization={PMLR}
}
[^3]: @article{donders2006gentle,
title={A gentle introduction to imputation of missing values},
author={Donders, A Rogier T and Van Der Heijden, Geert JMG and Stijnen, Theo and Moons, Karel GM},
journal={Journal of clinical epidemiology},
volume={59},
number={10},
pages={1087--1091},
year={2006},
publisher={Elsevier}
}
-->
Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications.

0 comments on commit 105fe77

Please sign in to comment.