-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add how="leftanti"
support for cudf-backed merge
#1073
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @charlesbluca - Seems reasonable to add "leftanti" support given that the necessary logic is pretty simple, and the legacy dask.dataframe
API supports it.
df2 = df2.rename(columns={"aa": "dd"}) | ||
assert_eq( | ||
df1.merge(df2, how="leftanti", left_on="aa", right_on="dd"), | ||
pdf1[~pdf1.aa.isin(pdf2.aa)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could just do this in merge_chunk for pandas data to support how="leftanti"
for cpu as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, can look into this a bit more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed some commits to dask/dask#11150 that, in conjunction with this PR, should unblock left anti/semi joins on CPU
Looks like we should unblocked to support left anti joins when
dataframe.backend="cudf"
, similar to the case in legacy Dask dataframe:https://github.com/dask/dask/blob/df4de6ea53054790b09006c8ea68ef8725d39025/dask/dataframe/multi.py#L565
Note that like the legacy code, we'll fail somewhere down in the comptutation stack if we try this on CPU - not sure if it makes sense to check the backend if
how="leftanti"
and eagerly raise aNotImplementedError
ifdataframe.backend != "cudf"
.cc @rjzamora