-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: selecting variables
through a user-supplied function
#589
Comments
Hey @david-cortes This sounds like a very specific case. I am not sure how wide-spread its use would be. Would you be able to provide an example? I can't really picture the scenario. Thank you |
A quick example for now: suppose I have a data frame with numeric features that have missing values, and I want to process it as follows:
In this case, the binary missing indicator columns should not get squared, since the output will be the same as the input, and one way would be by having the first transformer name those with a given suffix and then let the last transformer select columns without the suffix. You might then say that one can simply pass the column names directly to the last transformer, but then suppose that I want to try two different models using different subsets of the features, or that I want to apply them to two datasets sharing similar contents (e.g. data from 1-30 days ago and data from 31-60 days ago, which might have similar but not entirely equal column names). |
ColumnTransformer and make_column_selector support using callables to select columns. |
But those transformers from scikit-learn oftentimes force conversions between DataFrames and matrices, which is undesirable for the kind of transformations that feature_engine does. |
If you want ColumnTransformer to return a Dataframe you can do it using the method set_output For example: from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
X = pd.DataFrame({
"documents": ["First item", "second one here", "Is this the last?"],
"width": [3, 4, 5],
})
# "documents" is a string which configures ColumnTransformer to
# pass the documents column as a 1d array to the FeatureHasher
ct = ColumnTransformer(
[("text_preprocess", FeatureHasher(input_type="string"), "documents"),
("num_preprocess", MinMaxScaler(), ["width"])],
# This parameter ensures that original feature names are kept also in output DataFrame
verbose_feature_names_out=False
)
# Ensures that a DataFrame is returned by transform
ct.set_output("pandas")
X_trans = ct.fit_transform(X) |
Transformers in this library take an argument
variables
which is expected to be a list of column names.Oftentimes, one has variables that follow some natural grouping, and would want to apply a given transformer to all variables that match some naming pattern. It's relatively easy to do this when there is a single modeling pipeline by creating python variables with their names, but oftentimes one wants to try for example the same transformer pipeline with different groups of features, or slight variations of e.g. earlier transformations, etc. and thus the exact list of variables would vary from one run to another, and the transformers would need to be re-defined.
Would be helpful if the transformers could also accept
variables
as a function that would be applied to the column names and returnTrue
orFalse
as indicators of whether the transformer applies to each variable or not.The text was updated successfully, but these errors were encountered: