Enable the OneHotEncoder to be able to drop categories #339

27pchrisl · 2024-05-27T11:27:44Z

Hi,

I've been working with a sparse dataset, in which my '?' category should really be represented as none of the generated features being hot when using the OneHotEncoder.

This contribution adds this as a backwards-compatible option to the encoder.

andrewdalpino

It's a nice little change @27pchrisl thanks! I'll have to give some thought to it as to if it's the best way to accomplish the end goal.

andrewdalpino · 2024-06-30T19:31:35Z

src/Transformers/OneHotEncoder.php

+    /**
+     * @param string|array $drop The categories to drop during encoding
+     */
+    public function __construct($drop = [])


Although I think it's nice to be liberal with the type of this argument, it's tradition to prefer strict types unless necessary. In this case it's not necessary though.

27pchrisl · 2024-07-02T13:35:46Z

Thanks Andrew!

andrewdalpino · 2024-07-09T01:34:18Z

Hey @27pchrisl I'm interested to know if you've thought of other approaches ... for example, filtering specific categories from the dataset before OneHotEncoding it. Would a "CategoryDropper" Transformer allow for the same outcome when paired with OneHotEncoder but also serve other useful purposes? I get that you'd have to replace the category with something (perhaps a missing data placeholder ex. '?') and so it's not really "dropping" the category but maybe this could be handled by making OneHotEncoder "missing data aware" and ignore those data.

I think if we can rule out there being no better alternatives than to handle the "dropping" of categories in the OneHotEncoder, then this is a go.

Also, I'm just a tiny bit concerned about there being no discrimination between feature columns here. Like if the same set of categories were used to describe different features. You wouldn't have control over which columns to operate on it would always be all of them. This is not a deal-breaker for me though - just something we would want to make special note of in the documenation.

27pchrisl · 2024-07-09T09:06:45Z

Hi @andrewdalpino, yep I agree that if you have a feature where many categories should be not hot, the author should transform that outside of the OneHotEncoder so it can just do its own job. Similar to preparing using the MissingDataImputer. Then the OHE only needs to be told which single category should be dropped, probably defaulting to '?'.

I took inspiration from the signature from scikit-learn, which probably isn't the best source since python libraries tend to really overload their parameters ☺️

I'm using a very sparse dataset (CRM data), so I definitely need the capability for a none-hot category to prevent the model thinking the absence of a category is a category in itself. Absence represents poor quality data rather than a deliberate choice. My goal was to have the model ignore the feature in that case.

Create a 'drop' parameter

ae8deee

andrewdalpino reviewed Jun 30, 2024

View reviewed changes

andrewdalpino approved these changes Jul 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the OneHotEncoder to be able to drop categories #339

Enable the OneHotEncoder to be able to drop categories #339

27pchrisl commented May 27, 2024

andrewdalpino left a comment

andrewdalpino Jun 30, 2024

27pchrisl commented Jul 2, 2024

andrewdalpino commented Jul 9, 2024 •

edited

Loading

27pchrisl commented Jul 9, 2024 •

edited

Loading

Enable the OneHotEncoder to be able to drop categories #339

Are you sure you want to change the base?

Enable the OneHotEncoder to be able to drop categories #339

Conversation

27pchrisl commented May 27, 2024

andrewdalpino left a comment

Choose a reason for hiding this comment

andrewdalpino Jun 30, 2024

Choose a reason for hiding this comment

27pchrisl commented Jul 2, 2024

andrewdalpino commented Jul 9, 2024 • edited Loading

27pchrisl commented Jul 9, 2024 • edited Loading

andrewdalpino commented Jul 9, 2024 •

edited

Loading

27pchrisl commented Jul 9, 2024 •

edited

Loading