Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

marginaleffects::marginal_means deprecation #246

Conversation

vincentarelbundock
Copy link
Contributor

I am preparing for release 1.0.0 of marginaleffects and deprecating functions that will not be part of the “final / stable” user interface.

The marginal_means function now raises a deprecation warning and will be removed eventually.

The recommended strategy is now to use: predictions() with the datagrid() function to create a balanced grid, and the by argument to specify the variables for marginalization.

IMHO, this is more explicit and in line with the rest of the marginaleffects function design:

library(emmeans)
library(marginaleffects)

pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)

emm.src <- emmeans(pigs.lm, specs = "source")
summary(emm.src, infer = TRUE, null = log(35))
#  source emmean     SE df lower.CL upper.CL null t.ratio p.value
#  fish     3.39 0.0367 23     3.32     3.47 3.56  -4.385  0.0002
#  soy      3.67 0.0374 23     3.59     3.74 3.56   2.988  0.0066
#  skim     3.80 0.0394 23     3.72     3.88 3.56   6.130  <.0001
# 
# Results are averaged over the levels of: percent 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95


predictions(pigs.lm,                          # predictions using the `pigs.lm` model
  newdata = datagrid(grid_type = "balanced"), # on a balanced grid
  by = "source",                              # averaged across the `source` variable
  hypothesis = log(35),                       # null hypothesis
  df = 23)
# 
#  source Estimate Std. Error     t Pr(>|t|)    S 2.5 % 97.5 % Df
#    fish     3.39     0.0367 -4.39  < 0.001 12.2  3.32   3.47 23
#    soy      3.67     0.0374  2.99  0.00657  7.3  3.59   3.74 23
#    skim     3.80     0.0394  6.13  < 0.001 18.4  3.72   3.88 23
# 
# Columns: source, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, df 
# Type:  response

Copy link

codecov bot commented Feb 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9e2bb40) 35.10% compared to head (3e666dc) 34.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #246      +/-   ##
==========================================
- Coverage   35.10%   34.00%   -1.11%     
==========================================
  Files          25       24       -1     
  Lines        1168     1147      -21     
==========================================
- Hits          410      390      -20     
+ Misses        758      757       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DominiqueMakowski
Copy link
Member

Thanks Vincent, I've been finally experimenting a bit with the marginaleffects API, so I think I'm in a better position now to push do the overhaul that modelbased needs to depend on the latest marginaleffects

@vincentarelbundock
Copy link
Contributor Author

Sounds fantastic!

Don't hesitate to ping me if you need input, review, or code snippets.

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented Feb 11, 2024

Hi @vincentarelbundock I'm giving this a shot,

I'd like to make marginaleffects::predictions() work with insight::get_datagrid(), could you let me know if it's possible to, in the following example, somehow specify to predictions() that I want to marginalize over the continuous predictor Sepal.Width. I thought that providing a newdata grid with "missing" values would work, but no, so tried adding the by argument so specify, but it doesn't seem to do what I have in mind.

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
grid <- insight::get_datagrid(model, at="Species")
grid$Sepal.Width <- NULL
grid
#>      Species
#> 1     setosa
#> 2 versicolor
#> 3  virginica

marginaleffects::predictions(model,
                                        newdata=grid,
                                        by="Species")

Created on 2024-02-11 with reprex v2.0.2

tldr; what is the best way to obtain marginal means at species levels marginalized over Sepal.width?

@vincentarelbundock
Copy link
Contributor Author

What do you mean specifically by "marginalized over Sepal.Width"?

@vincentarelbundock
Copy link
Contributor Author

vincentarelbundock commented Feb 11, 2024

predictions() always requires a “full” grid, with all the variables used in the model. You should really think about it as enhanced version of the Base R function predict(). That’s why I used the same argument name newdata.

In addition, you get the by argument which specifies the categorical variables with respect to which we are averaging.

For example, you can marginalize over a balanced grid like this:

library(marginaleffects)

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)

grid <- expand.grid(
  Sepal.Width = unique(iris$Sepal.Width),
  Species = unique(iris$Species)
)

predictions(model,
  newdata = grid,
  by = "Species")
# 
#     Species Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
#  setosa         1.44     0.0638 22.6   <0.001 374.6  1.32   1.57
#  versicolor     4.62     0.0930 49.7   <0.001   Inf  4.44   4.80
#  virginica      5.71     0.0668 85.5   <0.001   Inf  5.58   5.84
# 
# Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
# Type:  response

Or you can marginalized over the empirical distribution of the actual data like the following (this is equivalent to the “cells” weights in emmeans):

predictions(model,
  newdata = iris,
  by = "Species")
# 
#     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#  setosa         1.46     0.0545  26.8   <0.001 524.4  1.36   1.57
#  versicolor     4.26     0.0545  78.2   <0.001   Inf  4.15   4.37
#  virginica      5.55     0.0545 101.9   <0.001   Inf  5.45   5.66
# 
# Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
# Type:  response

@DominiqueMakowski
Copy link
Member

Say I want to reproduce the emmeans results:

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
emmeans::emmeans(model, "Species")
#> NOTE: Results may be misleading due to involvement in interactions
#>  Species    emmean     SE  df lower.CL upper.CL
#>  setosa       1.43 0.0766 144     1.28     1.58
#>  versicolor   4.50 0.0742 144     4.35     4.65
#>  virginica    5.61 0.0563 144     5.50     5.72
#> 
#> Confidence level used: 0.95

Is it correct that the "optimal" way to do it in marginaleffects is the first option below:

marginaleffects::predictions(model,
                             newdata=marginaleffects::datagrid(grid_type = "balanced"),
                             variables="Species",
                             by="Species")
#> 
#>     Species Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.43     0.0766 18.7   <0.001 256.7  1.28   1.58
#>  versicolor     4.50     0.0742 60.6   <0.001   Inf  4.36   4.65
#>  virginica      5.61     0.0563 99.6   <0.001   Inf  5.50   5.72
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

marginaleffects::predictions(model,
                             by="Species")
#> 
#>     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.46     0.0545  26.8   <0.001 524.4  1.36   1.57
#>  versicolor     4.26     0.0545  78.2   <0.001   Inf  4.15   4.37
#>  virginica      5.55     0.0545 101.9   <0.001   Inf  5.45   5.66
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

Created on 2024-02-11 with reprex v2.0.2

I would intuitively expect the second to give the same results...

@DominiqueMakowski
Copy link
Member

Oh I see, after reading your comment I think that basically marginaleffects requires to explicitly state what to marginalize over.

If I understand, there are roughly 3 types of "means" you could get from the model:

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)


# Marginalized over range
marginaleffects::predictions(model,
                             newdata = insight::get_datagrid(model),
                             by = "Species")
#> 
#>     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.47     0.0548  26.7   <0.001 521.2  1.36   1.57
#>  versicolor     4.17     0.0574  72.7   <0.001   Inf  4.06   4.29
#>  virginica      5.52     0.0549 100.6   <0.001   Inf  5.42   5.63
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

# Marginalize over empirical distribution
marginaleffects::predictions(model,
                             newdata = insight::get_data(model),
                             by = "Species")
#> 
#>     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.46     0.0545  26.8   <0.001 524.4  1.36   1.57
#>  versicolor     4.26     0.0545  78.2   <0.001   Inf  4.15   4.37
#>  virginica      5.55     0.0545 101.9   <0.001   Inf  5.45   5.66
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

# Predictions at fixed mean
marginaleffects::predictions(model,
                             newdata = insight::get_datagrid(model, at="Species"))
#> 
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %    Species Sepal.Width
#>      1.43     0.0766 18.7   <0.001 256.7  1.28   1.58 setosa            3.06
#>      4.50     0.0742 60.6   <0.001   Inf  4.36   4.65 versicolor        3.06
#>      5.61     0.0563 99.6   <0.001   Inf  5.50   5.72 virginica         3.06
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, Species, Sepal.Width, Petal.Length 
#> Type:  response

Created on 2024-02-11 with reprex v2.0.2

Am I on the right track? Sorry for the dumb question I'm trying to wrap my head around the newest API

@vincentarelbundock
Copy link
Contributor Author

Yes, that's exactly right!

And if you're interested, here's a notebook where I show how to get equivalent results to emmeans with "re-gridding" and transformations:

https://arelbundock.com/emmeans_and_marginaleffects.html

@DominiqueMakowski
Copy link
Member

Thanks for your help @vincentarelbundock, I think I managed to find a fix for now to have estimate_means() working (I pushed directly on the main so I'll close this but thanks!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants