-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce responsibilities of DataModel #1098
Conversation
b865bed
to
c46ffb1
Compare
This is the start of fixing #1088 first, remove responsibilities of DataModel from labeler and core second, simplify how DataModel works internally third, prep for making it easier to make next changes backwards compatible by making the pickling shcema better versioned in api.py @fgregg there's some definitely good stuff in here (the first and second stuff), but I'm not sure if the third step is gonna be wasted effort. I thought I'd get a review from you at this point before I sink more time into this just o make sure I'm headed in the right direction. I can split this PR into three separete PRs too if you want. Thanks for the review! |
c46ffb1
to
3da6afe
Compare
It's not needed in there, so should just pass what is needed. This is prep for further refactoring of removing datamodel more
We don't use it after the initial construction, so don't store it
We only need the abstract requirement of a Featurizer function.
It is copied in DedupeDisagreementLearner, so I'm assuming that's a good idea. There's surely no downside that I can see, so might as well DEFINITELY avoid a foot gun
It isn't used publicly anywhere, and it adds more API surface area to worry about. This is still backwards compatible, as self._len is still saved/restored byt pickle the same way, now we just access it directly instead of going via __len__().
We iterate through the input more than once, so an Iterable is insufficient.
Before, we never checked for this, so when we called list.index() we just got back the first instance. I swapped us over to a lookup table because that makes more sense, but I wanted to also add this check because if someone had duplicate names, this would quitely change the behavior underneath someone: before it gave the first index, now it gives the last.
These variables 1. you can't currently create them because they inherit from Variable, not FieldVariable, so they don't appear in the datamodel.VARIABLE_CLASSES lookup table 2. You SHOULDN'T be able to instantiate these variables directly from a variable definition
Before the flow was: 1. create all the primary variables EXCEPT for the interactions 2. Expand those. 3. go back through and NOW create the InteractionVariables I found this confusing. We parse the var definitions twice in separate places, and InteractionVariables are a special case in the first pass. The other point of this is to make it so that there is one list of Variable instances which is the single source of truth. Once we do this then we can turn all the other private instance variables of DataModel into @functools.cached_property's, which will make it much easier to ensure pickle compatibility in the future, because only one variable needs to get saved and restored
This will make the init method cleaner and allow us to more easily add versioning. _load_settings will read the settings file, and for future versions of the class, will parse and convert the loaded data structures to the canonical in-memory representation that the class expects. See following commits
Deduplicate the one exception we make. Also now instead of just eating the catchall exception, we raise from it so that you have slightly more insight into its cause
3da6afe
to
83a2e89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, these all look like good code cleanups. I think it would be good to split into three PRs!
|
||
only_custom = all(isinstance(v, (CustomType, InteractionType)) for v in variables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable name should change. maybe no_blocking_variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded in followup pr #1102
return variables | ||
|
||
|
||
def _expand_higher_variables(variables: Iterable[Variable]) -> list[Variable]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could probably do away with this method if we had the base variable type have a higher_vars attribute set to an empty list or tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that won't quite work. it still feels like some of this should maybe be the responsible for the variable class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely feels gross, there needs to be some more official API. Trying not to deal with that here.
No description provided.