Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update mlflow-related metadata models #12174

Merged
merged 22 commits into from
Dec 24, 2024

Conversation

yoonhyejin
Copy link
Collaborator

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@yoonhyejin yoonhyejin changed the title update models feat: update mlflow-related metadata models Dec 19, 2024
Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a name field to mlModelGroup

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML Model Mapper + ML Model Group Mapper need to be updated to map the field "name"

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest modeling thing catching me up:

Data Process Instance + Data Job Entities already have an "outputs" lineage model. Simply extending them to include ML Model + ML Model Group is sufficient to render the lineage downstream (for DataJobs) and to show outputs of a Data Process Instance.

By starting to use the new training jobs field more extensively, we are introducing a large inconsistency in the model, a diverge between how lineage is specified for Datasets and AI Models that is sure to confuse our end users going forward.

I'd personally prefer to consider deprecating the trainingJobs field altogether and moving towards a model where the DataJob specifies it's outputs (as it already does with Datasets).

In the interim, we can support an either / OR type of situation at least between Data Job + ML Model / ML Model Group because it has traditionally already existed. But for Data Process Instance, I'd like to avoid introducing a new convention. Otherwise, we'll be forced to update the "outputs" of the Data Process rendering to fetch multiple different relationship types

Screenshot 2024-12-19 at 5 02 40 PM

jjoyce0510
jjoyce0510 approved these changes Dec 23, 2024
@yoonhyejin yoonhyejin merged commit 047644b into datahub-project:master Dec 24, 2024
80 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants