Skip to content

Commit

Permalink
add copyrights attribution to image description
Browse files Browse the repository at this point in the history
  • Loading branch information
eliasab16 committed Nov 25, 2023
1 parent a627102 commit 5821050
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -370,7 +370,7 @@ As described above, creators may consider crowdsourcing or synthetically generat

By providing clear, detailed documentation, creators can help developers understand how best to use their datasets. Several groups have suggested standardized documentation formats for datasets, such as Data Cards (@Pushkarna_Zaldivar_Kjartansson_2022), datasheets (@Gebru_Morgenstern_Vecchione_Vaughan_Wallach_III_Crawford_2021), data statements (@Bender_Friedman_2018), or Data Nutrition Labels (@Holland_Hosny_Newman_Joseph_Chmielinski_2020). When releasing a dataset, creators may describe what kinds of data they collected, how they collected and labeled it, and what kinds of use cases may be a good or poor fit for the dataset. Quantitatively, it may be appropriate to provide a breakdown of how well the dataset represents different groups (e.g. different gender groups, different cameras).

![This is an example of a data card for a computer vision dataset. It includes some basic information about the dataset and instructions on how to use or not to use the dataset, including known biases.](images/data_engineering/data_card.png)
![This is an example of a data card for a computer vision dataset. It includes some basic information about the dataset and instructions on how to use or not to use the dataset, including known biases. Copyrights: (@Pushkarna_Zaldivar_Kjartansson_2022)](images/data_engineering/data_card.png)

Keeping track of data provenance—essentially the origins and the journey of each data point through the data pipeline—is not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesn’t just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the model’s performance and its acceptability among end-users.

Expand Down

0 comments on commit 5821050

Please sign in to comment.