-
-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable the OneHotEncoder to be able to drop categories #339
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nice little change @27pchrisl thanks! I'll have to give some thought to it as to if it's the best way to accomplish the end goal.
/** | ||
* @param string|array $drop The categories to drop during encoding | ||
*/ | ||
public function __construct($drop = []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I think it's nice to be liberal with the type of this argument, it's tradition to prefer strict types unless necessary. In this case it's not necessary though.
Thanks Andrew! |
Hey @27pchrisl I'm interested to know if you've thought of other approaches ... for example, filtering specific categories from the dataset before OneHotEncoding it. Would a "CategoryDropper" Transformer allow for the same outcome when paired with OneHotEncoder but also serve other useful purposes? I get that you'd have to replace the category with something (perhaps a missing data placeholder ex. '?') and so it's not really "dropping" the category but maybe this could be handled by making OneHotEncoder "missing data aware" and ignore those data. I think if we can rule out there being no better alternatives than to handle the "dropping" of categories in the OneHotEncoder, then this is a go. Also, I'm just a tiny bit concerned about there being no discrimination between feature columns here. Like if the same set of categories were used to describe different features. You wouldn't have control over which columns to operate on it would always be all of them. This is not a deal-breaker for me though - just something we would want to make special note of in the documenation. |
Hi @andrewdalpino, yep I agree that if you have a feature where many categories should be not hot, the author should transform that outside of the I took inspiration from the signature from scikit-learn, which probably isn't the best source since python libraries tend to really overload their parameters I'm using a very sparse dataset (CRM data), so I definitely need the capability for a none-hot category to prevent the model thinking the absence of a category is a category in itself. Absence represents poor quality data rather than a deliberate choice. My goal was to have the model ignore the feature in that case. |
Hi,
I've been working with a sparse dataset, in which my '?' category should really be represented as none of the generated features being hot when using the
OneHotEncoder
.This contribution adds this as a backwards-compatible option to the encoder.