-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Expr
class
#158
base: main
Are you sure you want to change the base?
Refactor Expr
class
#158
Conversation
# Conflicts: # dask_expr/expr.py # dask_expr/io/parquet.py # dask_expr/reductions.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments
dask_expr/expr.py
Outdated
@@ -33,8 +19,7 @@ | |||
class Expr: | |||
"""Primary class for all Expressions | |||
|
|||
This mostly includes Dask protocols and various Pandas-like method | |||
definitions to make us look more like a DataFrame. | |||
This mostly includes Dask protocols. | |||
""" | |||
|
|||
commutative = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but we can remove commutativve / associative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are left over from the matchpy days
|
||
@property | ||
def npartitions(self): | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe npartitions should move over? Not a huge deal I guess if collections like scalar or delayed return 1 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this with the dask/dask implementation and for stuff like bag this is given through init, that's why I ordered it like this
|
||
def _task(self, index: int): | ||
assert index == 0 | ||
return self.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Literal stays? I wonder if we can remove divisions/meta and still have things work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look as a follow-up if ok?
dask_expr/frameexpr.py
Outdated
@@ -0,0 +1,1375 @@ | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm not a fan of the name frameexpr.py
. Maybe frame.py
, maybe dataframe/expr.py
? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with frame.py and Frame(Expr)
dask_expr/frameexpr.py
Outdated
""" | ||
|
||
def __hash__(self): | ||
return hash(self._name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about your thoughts on where hash
should live. Is this because _name
may not be on all expressions? (maybe it should?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very weird, we have a lot of tests failing if I add this only to Expr, but it was hard to investigate with all the other changes as well, that's why I put it here as well for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to move over __eq__
as well. Python has weird rules around hash and eq together.
def __getattr__(self, key): | ||
try: | ||
return object.__getattribute__(self, key) | ||
except AttributeError as err: | ||
# Allow operands to be accessed as attributes | ||
# as long as the keys are not already reserved | ||
# by existing methods/properties | ||
_parameters = type(self)._parameters | ||
if key in _parameters: | ||
idx = _parameters.index(key) | ||
return self.operands[idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we want this top bit in Expr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point, moved up
dask_expr/frameexpr.py
Outdated
def _simplify_down(self): | ||
return | ||
|
||
def _simplify_up(self, parent): | ||
return | ||
|
||
def optimize(self, **kwargs): | ||
return optimize(self, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should maybe be in Expr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, optimize is dataframe specific. The simplify methods then at least should probably be moved up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplify steps live in both actually, missed deleting them here
dask_expr/frameexpr.py
Outdated
@normalize_token.register(FrameExpr) | ||
def normalize_expression(expr): | ||
return expr._name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be in expr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
dask_expr/frameexpr.py
Outdated
no_default = "__no_default__" | ||
|
||
|
||
class FrameExpr(Expr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe?
class Frame(Expr)
Or is that weird for scalars? (maybe scalars shouldn't be Frames?)
In general I dislike.
class FooBar(Bar)
It always feels needlessly wordy to me. This is subjective though and I'm happy to be overruled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
I'm curious if we can make Scalar inherit from Expr rather than FrameExpr. This will likely stress things like Maybe that's follow-up work though if we want to keep things low-friction. |
OK, so let's imagine that we make it so that scalars don't inherit from FrameExpr, but just Expr. And then we want to add a scalar to a series: scalar + series First, we check class Add(Expr):
...
class Add(FrameExpr, expr.Add):
... This is starting to get a litlte weird. Maybe it's ok though. Part of me thinks "we should just use one I bring this up because it's possible that while this path seems generally pretty good now, that it leads us down some odd multi-class hell. This is some small motivation to not merge this yet until we've looked a bit farther down this path. (not pushing for that hard though). @rjzamora when you wake up you might want to take a look at this PR. |
This is a first suggestion. There is still stuff that needs to be done as follow up (e.g. adapting new_collection for example), but this will be a pain to synch, so want to keep scope as small as possible