-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding ordinal variables #613
Comments
Thanks for the suggestion! I am not sure I understand what the output of the encoder should be. Could you give us an example? |
For example, if there is a column taking possible values
i.e. a mean calculated by grouping over rows that have a value <= than a threshold in the column being encoded (so the calculation for a value of 2 would also involve rows with a value of 1), instead of a mean calculated by grouping over each value separately. |
thank you |
I second this. This type of encoding is very useful for linear modeling especially. It has an averaging effect on ordinal variables that is much more stable than simple one-hot encoding. @solegalli if I get a pull request together along with examples of how it is beneficial, is this something the team would consider merging? |
Thanks for joining this discussion. Yes, we tend to be quite open towards new functionality. I've never heard of / read about this type of encoding. Is there an article that you could link for more info? Or is this something that you guys do practically? common practice in some industry? To make it meaningful for potential users, we would have to add, besides the functionality, a good user guide with examples of how to use this class, and explanations about what constitutes a good use case for this type of encoding. You seem to have it covered though, because you mention examples of how this would be beneficial. So go for it! I look forward to the PR :) |
Couldn't this be accomplished by using ArbitraryDiscretiser followed by MeanEncoder? |
No, because it'd require overlaps between rows. |
Oftentimes, one wants to build linear models having ordinal variables as features (e.g. "rate in a scale from 1 to 5 ..."). One might treat these as numerical or categorical, but this loses some information.
Would be nice to have ordinal versions of some typical categorical encoders, such as mean/frequency encoders that would do the grouping by a condition
x<=c
instead ofx==c
.The text was updated successfully, but these errors were encountered: