-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make meta calculcation for merge more efficient #284
Open
phofl
wants to merge
7
commits into
dask:main
Choose a base branch
from
phofl:merge_meta
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
782645a
Merge projection selects too many columns
phofl 7a7802d
Make meta calculcation for merge more efficient
phofl 080735e
Merge branch 'merge_projection' into merge_meta
phofl a1d7e07
Update
phofl ce15534
Add custom constructor
phofl aabc2ef
Fix commit issue
phofl 0bc4143
Merge remote-tracking branch 'upstream/main' into merge_meta
phofl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would strongly prefer if we use a global key-value cache in the same way we cache dataset info for parquet. In fact, we should probably formalize this caching approach to avoid repeating the same kind of logic in multiple places.
It seems like a unique meta depends on a token like...
If
self.left._meta
orself.right._meta
were to change (due to column projection), we would need to recalculatemeta
anyway. However, if theMerge
object was responsible for pushing down the column projection, we could always update the cache within the simplify logic (since we would already know how the meta needs to change).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t see a global need for this yet. The slowdown in merge goes back to the nonempty meta objects, not the actual computation on empty objects.
some of the operations in Lower have side effects, which makes adjusting the meta objects of left and right bothersome and complicated.
I am open to adjusting the implementation if we run into this in more places, but as long as we need it only for merge I’d prefer this solution since we keep the complexity in here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw I am also not too big of a fan of relying on meta in hashes, there are too many things in pandas that might mutate this unexpectedly, which would break this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I don't really mind if we implement stand-alone caching logic in
_merge.py
for now. The thing I'm unsure about in this PR is that we are overriding__init__
so that we can effectively cache_meta
without adding_meta
as an official operand.It may be the case that this is exactly how we should be attacking this problem in
Merge
(and maybe everywhere). For example, maybe we will eventually have a specialknown_meta=
kwarg inExpr
, which all expression objects could leverage. However, since it is not a proper operand, this mechanism feels a bit confusing and fragile to me.I don't think I understand your point here. Either way we are effectively caching the output of
_meta
, no?I don't see how this is any different for
_precomputed_meta
? In any case where you are confident defining_precomputed_meta
, you could also just add the "lowered" object to the global cache before returning it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I'd say that should be a fundamental concern for dask-expr in then. What would be then most reliable way to hash the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but the nature of the slowdown makes me think that we need it only in merge and not in other places as of now. I am open to rewriting this here as well if this turns out differently.
pandas has some caveats that might change your dtype in meta but not on the actual df. Relying on the initial meta seems saver to me. But this might also be totally wrong.
@mrocklin and I chatted offline and landed on this solution. One motivating factor was the last part here: #284 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good answer for that. Meta is still the best bet, but it has some limitations. This will get more stable in the future since we are deprecating all of these caveats at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think you are are talking about the decision not to write general
Expr
-wide caching code here, which is all good with me. I was only thinking about the decision to use_precomputed_meta
instead of a simple k/v cache.Possible problems with the k/v cache approach:
Possible problems with
_precomputed_meta
:substitute_parameters
call will drop the information, even if you aren't changing information that is relevant to metaRight, I agree that it would be a mistake to make
_precomputed_meta
a proper operand.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would also have to patch meta of left and right in the HashJoin layer because that adds a column
substitute_parameters
parameters is annoying, we could override but that's not great either.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for adding all this discussion. I do get that you are focusing on the
Merge
-specificmeta
issue at hand.I'm just doing my best to keep the big picture in mind - It seems like we are going to keep running into cases where we would benefit from caching information outside the
Expr
object itself. Therefore, it would be nice if we could design a formal system where a collection of different caches can be managed in one place.That said, I definitely don't think we need to do something like that right now.
Right,
HashJoinP2P._meta_cache_token
does need to drop the hash columns.