Updated data_engineering.qmd

harvard-edge · Oct 10, 2023 · 105fe77 · 105fe77
1 parent fa31424
commit 105fe77
Showing 1 changed file with 1 addition and 142 deletions.
diff --git a/data_engineering.qmd b/data_engineering.qmd
@@ -383,151 +383,10 @@ In some cases, certain portions of a dataset may need to be removed or obscured
 
 Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.
 
-For instance, below is an example request from Common Voice users to remove their information:
-
-+-----------------------------------------------------------------------+
-| Thank you for downloading the Common Voice dataset. Account holders   |
-| are free to request deletion of their voice clips at any time. We     |
-| action this on our side for all future releases and are legally       |
-| obligated to inform those who have downloaded a historic release so   |
-| that they can also take action.                                       |
-|                                                                       |
-| You are receiving this message because one or more account holders    |
-| have requested that their voice clips be deleted. Their clips are     |
-| part of the dataset that you downloaded and are associated with the   |
-| hashed IDs listed below. Please delete them from your downloads in    |
-| order to fulfill your third party data privacy obligations.           |
-|                                                                       |
-| Thank you for your timely completion.                                 |
-|                                                                       |
-| -   4497f1df0c6c4e647fa4354ad07a40075cc95a210dafce49ce0c35cd252       |
-| e4ec0fad1034e0cc3af869499e6f60ce315fe600ee2e9188722de906f909a21e0ee57 |
-|                                                                       |
-| -   97a8f0a1df086bd5f76343f5f4a511ae39ec98256a0ca48de5c54bc5771       |
-| d8c8e32283a11056147624903e9a3ac93416524f19ce0f9789ce7eef2262785cf3af7 |
-|                                                                       |
-| -   969ea94ac5e20bdd7a098747f5dc2f6d203f6b659c0c3b6257dc790dc34       |
-| d27ac3f2fafb3910f1ec8d7ebea38c120d4b51688047e352baa957cc35f0f5c69b112 |
-|                                                                       |
-| -   6b5460779f644ad39deffeab6edf939547f206596089d554984abff3d36       |
-| a4ecc06e66870958e62299221c09af8cd82864c626708371d72297eaea5955d8e46a9 |
-|                                                                       |
-| -   33275ff207a27708bd1187ff950888da592cac507e01e922c4b9a07d3f6       |
-| c2c3fe2ade429958c3702294f446bfbad8c4ebfefebc9e157d358ccc6fcf5275e7564 |
-+=======================================================================+
-+-----------------------------------------------------------------------+
-
 Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample\'s effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That\'s a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate[^8]^,^[^9]^,^[^10] its impact on the model\'s behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.
 
 Dataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering.
 
 ## Conclusion
 
-Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications.
-
-
-## Helpful References
-1\. \[3 big problems with datasets in AI and machine
-learning\](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/)
-
-2\. \[Common Voice: A Massively-Multilingual Speech
-Corpus\](https://arxiv.org/abs/1912.06670)
-
-3\. \[Data Engineering for Everyone\](https://arxiv.org/abs/2102.11447)
-
-4\. \[DataPerf: Benchmarks for Data-Centric AI
-Development\](https://arxiv.org/abs/2207.10062)
-
-5\. \[Deep Spoken Keyword Spotting: An
-Overview\](https://arxiv.org/abs/2111.10592)
-
-6\. \["Everyone wants to do the model work, not the data work": Data
-Cascades in High-Stakes AI\](https://research.google/pubs/pub49953/)
-
-7\. \[Improving Reproducibility in Machine Learning Research (A Report
-from the NeurIPS 2019 Reproducibility
-Program)\](https://arxiv.org/abs/2003.12206)
-
-8\.
-\[LabelMe\](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf)
-
-9\. \[Model Cards for Model
-Reporting\](https://arxiv.org/abs/1810.03993)
-
-10\. \[Multilingual Spoken Words
-Corpus\](https://openreview.net/pdf?id=c20jiJ5K2H)
-
-11\.
-\[OpenImages\](https://storage.googleapis.com/openimages/web/index.html)
-
-12\. \[Pervasive Label Errors in Test Sets Destabilize Machine Learning
-Benchmarks\](https://arxiv.org/abs/2103.14749)
-
-13\. \[Small-footprint keyword spotting using deep neural
-networks\](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)
-
-14\. \[SpecAugment: A Simple Data Augmentation Method for Automatic
-Speech Recognition\](https://arxiv.org/abs/1904.08779)
-
-[^1]: Janssen, Marijn, et al. \"Data governance: Organizing data for
-    trustworthy Artificial Intelligence.\" *Government Information
-    Quarterly* 37.3 (2020): 101493.
-
-[^2]: Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. \"Mel Frequency
-    Cepstral Coefficient and its applications: A Review.\" *IEEE Access*
-    (2022).
-
-[^3]: Vasuki, A., and P. T. Vanathi. \"A review of vector quantization
-    techniques.\" *IEEE Potentials* 25.4 (2006): 39-47.
-
-[^4]: Rybakov, Oleg, et al. \"Streaming keyword spotting on mobile
-    devices.\" *arXiv preprint arXiv:2005.06720* (2020).
-
-[^5]: <https://ieeexplore.ieee.org/document/9153560>
-
-[^6]: Birhane, Abeba, and Vinay Uday Prabhu. \"Large image datasets: A
-    pyrrhic win for computer vision?.\" *2021 IEEE Winter Conference on
-    Applications of Computer Vision (WACV)*. IEEE, 2021.
-
-[^7]: Sonnenburg, Soren, et al. \"The need for open source software in
-    machine learning.\" (2007): 2443-2466.
-
-[^8]: Ginart, Antonio, et al. \"Making ai forget you: Data deletion in
-    machine learning.\" *Advances in neural information processing
-    systems* 32 (2019).
-
-[^9]: Sekhari, Ayush, et al. \"Remember what you want to forget:
-    Algorithms for machine unlearning.\" *Advances in Neural Information
-    Processing Systems* 34 (2021): 18075-18086.
-
-[^10]: Guo, Chuan, et al. \"Certified data removal from machine learning
-    models.\" *arXiv preprint arXiv:1911.03030* (2019).
-
-<!--
-[^1]: @inproceedings{dwork2006differential,
-      title={Differential privacy},
-      author={Dwork, Cynthia},
-      booktitle={International colloquium on automata, languages, and programming},
-      pages={1--12},
-      year={2006},
-      organization={Springer}
-}
-[^2]: @inproceedings{buolamwini2018gender,
-title={Gender shades: Intersectional accuracy disparities in commercial gender classification},
-author={Buolamwini, Joy and Gebru, Timnit},
-booktitle={Conference on fairness, accountability and transparency},
-pages={77--91},
-year={2018},
-organization={PMLR}
-}
-[^3]: @article{donders2006gentle,
-title={A gentle introduction to imputation of missing values},
-author={Donders, A Rogier T and Van Der Heijden, Geert JMG and Stijnen, Theo and Moons, Karel GM},
-journal={Journal of clinical epidemiology},
-volume={59},
-number={10},
-pages={1087--1091},
-year={2006},
-publisher={Elsevier}
-}
--->
+Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications.