-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mergeduplicates keyword to handle makeunique=false #3366
base: main
Are you sure you want to change the base?
Conversation
Response to #2243 |
Thank you for the proposal. It requires a design discussion before reviewing the implementation. The options I see as alternatives to your design proposals are (based on my opinion and some short discussion with @nalimilan):
@nalimilan - could you please comment on what you prefer? (or maybe some other option). In particular I do not think we need the |
I like the idea of adding a function option to this, but ‘makeunique’ seemed the wrong keyword to capture this functionality, it’s not explanatory anymore, which is why I moved it to ‘dupcol’. That said I’d be fine with ‘makeunique’ being either Bool or a function. That makes the signature simpler and would be more flexible. My implementation would easily adapt to that if that’s where you want to go.
|
I agree it seems simpler to use the existing |
OK - so the change would be to allow passing a
Thank you! |
Literally |
I preferred Do you see some concrete example of a useful callable that is not a function? Note that this function requires 2 positional arguments to be passed (and typically, non-function callables accept only one anyway - of course, this is not a 100% correct rule either, but just an intuition) |
No concrete example, just a general style, but DataFrames.jl has its own style so I'm not too concerned about it. And I guess using |
One argument for a new keyword is that currently, That said, I was recently wishing for this functionality, so 👍 from me! Side note - maybe needs a different issue, but could also be an argument for reserving
It took me a while to sort out which columns were actually duplicated (this was in a table with thousands of columns - I know, not ideal) I would have liked to be able to do something like |
This is a good point. Maybe indeed we should add both, i.e. And another kwarg for merging columns (which, if set, would disable |
@leei - why have you closed this PR? I thought we had a conclusion that this is a valuable addition (just the design should be a bit different). |
Strange. Must have happened automatically when I moved the commit to a new branch. Not sure how to resolve this since I want to preserve the discussion but that previous pull is a bit wrongheaded. I have a new PR that contains different changes that overload the makeunique kw in the way described above. I still have that commit around but I like the "overload makeunique" solution best that takes three kinds of args:
The currently allowed keywords are:
Should I just add a new PR with the updated patch, restore the commit on main that this PR is based on, or both? |
… columns in joins and DataFrame constructors
The alternative implementation is #3373 |
An alternative is to locally REBASE the PR and then force-push it to GitHub. Having said that what I would propose (in respect for your time) is that we first discuss the final design and then this design is implemented. This is the standard procedure that we always follow in DataFrames.jl (because there is usually a lot of discussion around every design decision). Also if we have a design decision it does not mean that it would have to be "in full" implemented in one PR (it could, but it does not have to, as ). In particular what @kescobo commented on is that it would be good to mentally separate two operations:
On the other hand maybe indeed what you propose is good. How I understand it is the following. We have only
If this is the proposal maybe we could extend it with two extra options (non-conflicting with what you proposed):
@leei, @nalimilan, @kescobo, @jariji - can you please comment on this. So that we have a clear vision where we go before moving forward with the implementation? Thank you! |
Can somebody give a motivating example for when I would want |
This is my opinion, very lightly held. That is, I think both approaches are improvements, and neither is so obviously correct / better today I would argue strongly for it. To "make (cols) unique" implies that you will end up with multiple columns. Also, I can see situations where it might be useful to be able to use a function other than
I'm not totally clear on what the arguments and expected outputs for the function versions are, but my first instinct is:
|
My original proposal was to replace `makeunique` with `dupcol` , which indicates the action when faced with duplicate columns as a result of the join. Logical, simple and can easily handle both keyword and Function arguments.
Big con: Very disruptive of the existing API, since it extends the `makeunique` functionality and until `makeunique` is removed it’s possible to provide contradictory instructions e.g. `makeunique=true` and `dupcol=:update`.
While I agree that retaining `makeunique` and extending it to take `Bool`, `Symbol` and `Function` arguments is aesthetically dodgy, it’s much less disruptive of the existing API. If there was a simple way to deprecate such a universally used kw and replace it with one that’s better named, I’d be all for it.
As far as the particular keyword choices?
If I’ve got new data coming (e.g. from an external API call, say asking for records changed since the last update time) and I don’t know which are new and which are updates then a join with `makeunique=:update` is exactly what I need. In fact for that case I’d probably also want `outerjoin!`
`makeunique=:ignore` is also pretty obvious to me in cases where I’m planning to ignore the duplicated columns afterward, esp. when I’m exploring calculations, hit an error from duplicated columns, and want to see my results before deciding what I really want to do.
To me there are a few options. Bite the bullet and introduce a new kw (`dupcol` or other name) with deprecation warnings for `makeunique` (i.e. this PR), or override `makeunique` and tolerate the aesthetics (i.e. the other PR).
It is of course possible to introduce the new semantics on `makeunique` and then add a subsequent change to the kw name.
Kevin’s is of course another take, since it doubles down on `makeunique`’s current meaning and retains it, while allowing for a `dupcol`-like extension. To me having a single kw for “what do I do with duplicates” is a better option and adding a second kw that handles “how do I generate new column names” makes more sense.
And I do apologize for just going ahead and coding this w/o the design discussion. Different repos have different cultures. I need the functionality and created the fork for my own purposes (see the `:update` example above)
|
FWIW - I don't think anyone has a problem with this - it's mostly out of knowledge that your time is precious too, and no one wants you to do a bunch of coding that ends up getting tossed in the bin because folks disagree with the design. If you needed / wanted this anyway, I don't think there's anything wrong with providing a concrete example as an opening salvo, as long as you won't be offended if the consensus lands in a different place (and it doesn't seem like you will be so 👍 ) |
It's a good point that we may want to allow passing a function to |
100% agreed and this is what I meant.
Let me summarize the requirements we have. The options we need to support are
Now the question is how to handle this with keyword arguments. We have the following options:
So my opinion is that I would not do point 3. For me doing 1 or 2 is OK. I have a slight preference for point 1 (all in |
I'm not a fan of the |
Let me propose a (4) that is perhaps the worst of all worlds 😅. It's kind of merging @bkamins 1/3, but addresses @nalimilan 's concern
|
This is something I think we should not do as the meaning of this kwarg would change depending on other kwarg. This is something that we should avoid as writing Given @nalimilan comment let me propose the following rules:
Note that it is crucial that the function accepts more than 2 arguments as in some cases we can have data that introduce more than 2 duplicate columns. Also when implementing
Fortunately for What do you think? |
Makes sense to me 👍 |
behaviour to match bkamins comment in JuliaData#3366 Is now used to pass a Function to handle cases where makequnique=false by combining those values (passed as parameters) into a returned result.
Just updated this PR to conform to @bkamins proposal above, |
|
||
Horizontally concatenate data frames. | ||
|
||
If `makeunique=false` (the default) column names of passed objects must be unique. | ||
If `makeunique=true` then duplicate column names will be suffixed | ||
with `_i` (`i` starting at 1 for the first duplicate). | ||
|
||
If `makeunique=false` and `mergeduplicates` is a Function then duplicate column names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If `makeunique=false` and `mergeduplicates` is a Function then duplicate column names | |
If `makeunique=false` and `mergeduplicates` is a `Function` then duplicate column names |
will be combined by this function with the column named overwritten by the results of | ||
the function on all values from the duplicated column(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear what this sentence means. Probably a native speaker would be better to say how to fix it.
What I can say is how this will work mergeduplicates(mergeduplicates(x,y),z)
where x
, y
and z
are consecutive duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I say "ouch". I am very much a native English speaker. That said I've struggled with this wording to make it clearer... more struggle then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the following be clearer?
If
makeunique=false
andmergeduplicates
is aFunction
then duplicate columns
will be combined by invoking the function with all values from those columns.
e.g.mergeduplicates=coalesce
will use the first non-missing value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crucial issue is, as I have commented, that the behavior or mergeduplicates
will differ depending on the function in which it is used (unless we change the current implementation).
In functions like insertcols
or DataFrame
it will pass all duplicate columns as consecutive arguments to a SINGLE CALL to mergeduplicates
function.
In functions like hcat
or *join
it will perform a recursive call taking at most two duplicate columns at a time.
|
||
julia> df3.A === df1.A | ||
true | ||
``` | ||
""" | ||
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, copycols::Bool=true) | ||
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) | |
function Base.hcat(df::AbstractDataFrame; makeunique::Bool=false, mergeduplicates::Union{Nothing, Function}=nothing, copycols::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking it that you're suggesting that mergeduplicates
be typed everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
insert!(_columns(dfp), col_ind, item_new) | ||
else | ||
# Just update without adding to index | ||
merge = get(mergecolumns, name, (dfp=dfp, cols=[])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you need dfp
for here?
I have started looking at this PR and left some initial general comments (but they would apply to all code not only the parts that I have commented). However, having seen the amount of code affected I would really prefer if you split the PR into several PRs changing individual functions (e.g. It would greatly help me with the review (it is really hard to review large PRs). This is especially crucial as I need to make sure that we have an appropriate test coverage for every function changed. Would it be OK with you to make such a change? (we can keep this PR as a working reference, and these new PRs could be done as a cut-out from this one) |
I’d like to but... this functionality dives deep down into the basic initialization functionality of DataFrame and Index. It might be possible to break it down into chunks that reflect the top-level changes from a test perspective. I’ll see if I can do that easily.
Lee Iverson
604-290-1798
…On Sep 25, 2023 at 2:08 PM -0700, Bogumił Kamiński ***@***.***>, wrote:
I have started looking at this PR and left some initial general comments (but they would apply to all code not only the parts that I have commented).
However, having seen the amount of code affected I would really prefer if you split the PR into several PRs changing individual functions (e.g. hcat separately, DataFrame separately, insertcols+insertcols! separately).
It would greatly help me with the review (it is really hard to review large PRs). This is especially crucial as I need to make sure that we have an appropriate test coverage for every function changed. Would it be OK with you to make such a change? (we can keep this PR as a working reference, and these new PRs could be done as a cut-out from this one)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
makeunique::Bool=false, copycols::Bool=true) | ||
u = add_names(index(df1), index(df2), makeunique=makeunique) | ||
makeunique::Bool=false, mergeduplicates=nothing, copycols::Bool=true) | ||
u = add_names(index(df1), index(df2), makeunique=true, mergeduplicates=mergeduplicates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implementation looks incorrect. Where do you use the mergeduplicates
if passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's presence is necessary to prevent the error throw in add_names
.
That said, I don't see any use of merge
or merge!
in the codebase, so the change was for completeness, since add_names
gets used in hcat!
and needs to be sensitive to mergeduplicates
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I do not understand why. add_names
works on index
so it cannot take into account mergeduplicates
anyway. Why and how do you think it should?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear - I saw implementation of add_name
that you have changed. I just feel that add_names
should not handle the mergeduplicates
kwarg because it makes the design non-obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to further clarify my thinking. We need to make sure that the design we use is simple. The reason is that this package is large and it is important to make sure that future developers have an easy way to understand how everything works. That is the reason I recommend taking "small steps", as this ensures that these small steps have clean design.
@@ -128,8 +138,8 @@ function Base.push!(x::Index, nm::Symbol) | |||
return x | |||
end | |||
|
|||
function Base.merge!(x::Index, y::AbstractIndex; makeunique::Bool=false) | |||
adds = add_names(x, y, makeunique=makeunique) | |||
function Base.merge!(x::Index, y::AbstractIndex; makeunique::Bool=false, mergeduplicates=nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function for sure does not need mergeduplicates
kwarg as it does not take a data frame as a source
That is one of my points (I have added comments). The functionality should not affect E.g. I think starting with |
OK, I've finally found the time to revise the implementation to cover the cases you've raised and break it down into a series of commits that are relatively self-contained. It is on a different branch in my repo though. Thid PR won't seem to let me change the branch. Is that because it's in review? In any case, the revised PR is on my updateindex branch. |
No, but you can rename and then force push, I think |
It is OK to open a new PR. Then in this PR we keep history (if needed). And once things are finalized the historical PR can be just closed. |
The new PR tracking after all of this discussion is #3401 |
Add the
dupcol
keyword In all constructors and joins that allowmakeunique=true
to generate new column names when there are duplicates among the constructed or joined columns. Allows for extensible actions to be taken when these duplicate columns are detected::error
throws an error when duplicate column names are given or created (equivalent tomakeunique=false
, the default),:makeunique
creates new column names for the duplicate columns using the same approach asmakeunique=true
), and:update
which coalesces values by updating the values from the first duplicate column with non-missing values from each subsequent duplicate column name.N.B. This approach is extensible to other possible actions such as
:ignore
or:overwrite
if so desired.Also warns that
makeunique=true
as deprecated but maintains its functionality.