Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying model without statsmodels.formulas.api seems to not work in both pandas and polars #140

Open
artiom-matvei opened this issue Nov 10, 2024 · 1 comment

Comments

@artiom-matvei
Copy link
Contributor

I think it would be useful to support specifying a model without statsmodels.formulas.api

Pandas example

import pandas as pd
import statsmodels.api as sm
import marginaleffects as me

# Load and prepare the data
penguins = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv")
penguins_clean = penguins[['island', 'bill_length_mm', 'flipper_length_mm']].dropna()
penguins_clean['island'] = pd.Categorical(penguins_clean['island'], categories=["Biscoe", "Dream", "Torgersen"])

# Prepare predictors and target
X = penguins_clean[[ 'bill_length_mm', 'flipper_length_mm']]
X = sm.add_constant(X)
y = penguins_clean['island'].cat.codes

# Fit the model
model_py = sm.MNLogit(y, X).fit()

# Extract and display coefficients
coef_py = model_py.params
print("Coefficients from Python model:")
print(coef_py)

me.predictions(model_py)

Currently it throws the error below, probably meaning that get_modeldata() could be generalized:

File c:\\...\\pymarginaleffects\\marginaleffects\\model_statsmodels.py:17, in ModelStatsmodels.get_modeldata(self)
     16 def get_modeldata(self):
---> 17     df = self.model.model.data.frame
     18     if not isinstance(df, pl.DataFrame):
     19         df = pl.from_pandas(df)

AttributeError: 'PandasData' object has no attribute 'frame'"

Polars example

Changing to polars does not fix it, it seems like get_modeldata() could be improved. See example below and error thrown:

import polars as pl
import statsmodels.api as sm
import marginaleffects as me

# Load and prepare the data
penguins = pl.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv")
penguins_clean = penguins.select(['island', 'bill_length_mm', 'flipper_length_mm']).drop_nulls()


# Step 3: Define island categories and create a mapping
island_categories = ["Biscoe", "Dream", "Torgersen"]
island_mapping = {island: code for code, island in enumerate(island_categories)}

# Step 4: Map 'island' to integer codes and create a new column 'island_code'
penguins_clean = penguins_clean.with_columns(
    pl.col('island').replace_strict(island_mapping).alias('island_code')
)

# Step 5: Prepare predictor variables
X = penguins_clean.select(['bill_length_mm', 'flipper_length_mm'])

# Step 6: Add a constant term for the intercept
X = X.with_columns(pl.lit(1.0).alias('const'))
X = X.select(['const', 'bill_length_mm', 'flipper_length_mm'])

# Step 7: Extract the target variable
y = penguins_clean['island']

# Step 8: Convert Polars DataFrames to NumPy arrays
X_np = X.to_numpy()
y_np = y.to_numpy()

# Step 9: Fit the multinomial logistic regression model
model_py = sm.MNLogit(y_np, X_np).fit()

# Step 10: Extract and display coefficients
coef_py = model_py.params
print("Coefficients from Python model:")
print(coef_py)

me.predictions(model_py)

Error thrown:

File c:\\...\\pymarginaleffects\\marginaleffects\\model_statsmodels.py:17, in ModelStatsmodels.get_modeldata(self)
     16 def get_modeldata(self):
---> 17     df = self.model.model.data.frame
     18     if not isinstance(df, pl.DataFrame):
     19         df = pl.from_pandas(df)

AttributeError: 'ModelData' object has no attribute 'frame'"
@vincentarelbundock
Copy link
Owner

This is not possible because all our contrast building functions work on data frames, not on arrays. For example, imagine you have a numeric predictor and a string predictor. In a data frame, that is two columns. In a numpy array, we must one hot encode the string variable, so the array has many more columns.

This is why in the scikit learn proposal, there must be a pipeline object to convert a single newdata into two separate X and y: #35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants