Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate parts of metadata update comparison #1564

Open
melange396 opened this issue Dec 5, 2024 · 0 comments
Open

Automate parts of metadata update comparison #1564

melange396 opened this issue Dec 5, 2024 · 0 comments
Labels
code health readability, maintainability, best practices, etc data quality devops building, running, deploying, environment stuff, handy utils, repository-related, engineer QoL, etc documentation enhancement

Comments

@melange396
Copy link
Collaborator

The CSV files (derived from a google spreadsheet) that hold important semantic metadata about our signals and sources are getting quite large. There has also been a lot of recent activity in editing their content, due to work on the signal documentation app. Together that means there are potentially more frequent and bigger diffs to compare when updates happen. To ease the process of reviewing such changes, create automated summaries of:

  • any added or removed columns (by name)
  • counts of rows changed, unchanged, and modified (ignoring new or deleted columns, where applicable)

Row comparisons should be keyed by source+signal instead of just by row number/position, to be more resilient to any row reorderings that happen.

The CSV files can be found at:

Their generation is kicked off by a GH action and performed by code in https://github.com/cmu-delphi/delphi-epidata/blob/dev/tasks.py.

Ideally, the summary text should be added to the body of the PR produced by the GH action.

There is some very rudimentary comparison code that might be useful as a starting point in #1546 (comment).

@melange396 melange396 added enhancement documentation code health readability, maintainability, best practices, etc devops building, running, deploying, environment stuff, handy utils, repository-related, engineer QoL, etc data quality labels Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code health readability, maintainability, best practices, etc data quality devops building, running, deploying, environment stuff, handy utils, repository-related, engineer QoL, etc documentation enhancement
Projects
None yet
Development

No branches or pull requests

1 participant