-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update mlflow-related metadata models #12174
Conversation
metadata-models/src/main/pegasus/com/linkedin/dataprocess/DataProcessInstanceOutput.pdl
Outdated
Show resolved
Hide resolved
metadata-ingestion/tests/unit/stateful_ingestion/state/test_stateful_ingestion.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a name field to mlModelGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML Model Mapper + ML Model Group Mapper need to be updated to map the field "name"
metadata-models/src/main/pegasus/com/linkedin/dataprocess/DataProcessInstanceProperties.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/dataprocess/DataProcessInstanceInput.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLModelProperties.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/ml/metadata/MLModelProperties.pdl
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Biggest modeling thing catching me up:
Data Process Instance + Data Job Entities already have an "outputs" lineage model. Simply extending them to include ML Model + ML Model Group is sufficient to render the lineage downstream (for DataJobs) and to show outputs of a Data Process Instance.
By starting to use the new training jobs field more extensively, we are introducing a large inconsistency in the model, a diverge between how lineage is specified for Datasets and AI Models that is sure to confuse our end users going forward.
I'd personally prefer to consider deprecating the trainingJobs field altogether and moving towards a model where the DataJob specifies it's outputs (as it already does with Datasets).
In the interim, we can support an either / OR type of situation at least between Data Job + ML Model / ML Model Group because it has traditionally already existed. But for Data Process Instance, I'd like to avoid introducing a new convention. Otherwise, we'll be forced to update the "outputs" of the Data Process rendering to fetch multiple different relationship types
Checklist