Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move visualisation_matrix lower to make it more re-usable #44

Open
DominiqueMakowski opened this issue Dec 10, 2021 · 22 comments
Open

move visualisation_matrix lower to make it more re-usable #44

DominiqueMakowski opened this issue Dec 10, 2021 · 22 comments
Assignees

Comments

@DominiqueMakowski
Copy link
Member

A robust solution to generate the data / visualisation grid could be useful to other developers/packages as well (@vincentarelbundock), so we might want to see whether to move it to datawizard or insight. We could rename it something like data_grid() or get_datagrid() (if in insight), and reexport it under its current name (visualization_matrix) in modelbased to avoid any breaking changes.

What do you think?

@vincentarelbundock
Copy link

Interesting idea.

I’m not super familiar with visualization_matrix, but took a quick
look at the documentation and it seems very cool.

FWIW, I already have a similar function called
typical:

library(marginaleffects)

dat <- mtcars[, c("mpg", "gear", "hp", "am", "vs")]
dat$gear <- as.factor(dat$gear)
mod <- lm(mpg ~ hp + am, mtcars)

typical(model = mod, hp = seq(111, 120, by = 5))
##        am  hp
## 1 0.40625 111
## 2 0.40625 116
typical(newdata = head(dat, 3))
##    mpg gear       hp am        vs
## 1 21.6    4 104.3333  1 0.3333333

AFAICT, the main differences are:

  1. at values are either passed by parsing strings like "Petal.Width = seq(-3, 3)" (in visualization_matrix) or through ... (in
    typical).
  2. visualization_matrix allows users to specify which function to use
    to summarize variables not mentioned explicitly (e.g., mean or
    mode).

On the one hand, since I am already committed to ... strategy, I would
have to adopt a new dependency and write a wrapper to make things work
as before in typical. On the other hand, allowing different summary functions for
factors or numerics would just require a new argument and a couple lines
of code. So it seems like either way I'll have to do some work.

All this to say that, from a conceptual perspective, I agree it would
make more sense to put visualization_matrix in datawizard. However,
I’m not sure it makes sense to do work on my account, given that I’m not
even 100% sure I would adopt…

Thanks for the ping!

@DominiqueMakowski
Copy link
Member Author

Would it be worth it to allow passing variables via ... in vizmatrix?

adopot a new dependency

Maybe in insight then? Though this function really manipulates data, it is meant to be used with models most of the time so it really falls between datawizard and insight

@vincentarelbundock
Copy link

Looks like the ... are already used for something else:

… Arguments passed to or from other methods (for instance, length or range to control the spread of numeric variables.).

I agree it’s often a good idea to make sure the functions are in the “right” package. But I’m also not sure if there’s enough immediate demand such that it would be worth much trouble to move things around.

FWIW, I just added support for visualisation_matrix to marginaleffects (github devel-only) so you can now use it directly without having to specify the x argument manually (saves a few key strokes):

library(modelbased)
library(marginaleffects)

mod <- lm(mpg ~ factor(cyl) + hp, mtcars)

predictions(mod, newdata = visualisation_matrix(at = "cyl"))
#>       type predicted std.error conf.low conf.high cyl      mpg       hp
#> 1 response  25.12392  1.368888 22.31988  27.92796   4 20.09062 146.6875
#> 2 response  19.15627  1.247190 16.60151  21.71102   6 20.09062 146.6875
#> 3 response  16.60307  1.278754 13.98366  19.22248   8 20.09062 146.6875

marginaleffects(mod, newdata = visualisation_matrix(at = "cyl"))
#>   rowid     type term contrast        dydx  std.error cyl      mpg       hp
#> 1     1 response  cyl    6 - 4 -5.96765508 1.63927762   4 20.09062 146.6875
#> 2     2 response  cyl    6 - 4 -5.96765508 1.63927762   6 20.09062 146.6875
#> 3     3 response  cyl    6 - 4 -5.96765508 1.63927762   8 20.09062 146.6875
#> 4     1 response  cyl    8 - 4 -8.52085075 2.32607494   4 20.09062 146.6875
#> 5     2 response  cyl    8 - 4 -8.52085075 2.32607494   6 20.09062 146.6875
#> 6     3 response  cyl    8 - 4 -8.52085075 2.32607494   8 20.09062 146.6875
#> 7     1 response   hp          -0.02403883 0.01540789   4 20.09062 146.6875
#> 8     2 response   hp          -0.02403883 0.01540790   6 20.09062 146.6875
#> 9     3 response   hp          -0.02403883 0.01540789   8 20.09062 146.6875

@strengejacke
Copy link
Member

@vincentarelbundock when you look at str() of the visualization-matrix object, you see some special attributes. Could you also add attributes such as "adjusted_for", "at_specs", "at" or "preserve_range" to the object returned by predictions()?

ou may choose different names, but the general idea is to have this information available for potential plotting methods.

@DominiqueMakowski
Copy link
Member Author

Looks like the ... are already used for something else:

well, we could query the ellipsis it and if the user passed something that corresponds to the name of a variable in the dataframe, then use it. The only edgecase would be if someone has a variable called length and does visualisation_matrix(mod, length = 3) as this would be an ambiguous formulation. But it's fairly easy to catch and throw a clear warning explaining it.

I guess it boils down to wether anyone (@strengejacke @mattansb ?) can think of any other usecase for a data-gridding beyond modelbased? wouldn't it be useful in insight I vaguely remember in the prediction's CIs computations we had to obtain the grid to get the covariance or something? perhaps it's a false memory I can blame that on long-covid

@strengejacke
Copy link
Member

btw, predictions() with visualization_matrix() as arg for newdata again fails for brms-models. Should be related to easystats/modelbased#161.

@strengejacke
Copy link
Member

Looks like the ... are already used for something else:

well, we could query the ellipsis it and if the user passed something that corresponds to the name of a variable in the dataframe, then use it. The only edgecase would be if someone has a variable called length and does visualisation_matrix(mod, length = 3) as this would be an ambiguous formulation. But it's fairly easy to catch and throw a clear warning explaining it.

I guess it boils down to wether anyone (@strengejacke @mattansb ?) can think of any other usecase for a data-gridding beyond modelbased? wouldn't it be useful in insight I vaguely remember in the prediction's CIs computations we had to obtain the grid to get the covariance or something? perhaps it's a false memory I can blame that on long-covid

Can you write some example code or similar? I don't get the discussion about the ellipses, and why we should not move vis-mat to datawizard?

@vincentarelbundock
Copy link

vincentarelbundock when you look at str() of the visualization-matrix object, you see some special attributes. Could you also add attributes such as "adjusted_for", "at_specs", "at" or "preserve_range" to the object returned by predictions()?

Absolutely. In the dev version I now attach all the attributes of the visualisation_matrix object except names, row.names, class, data, and reference (the original data is already included in the main predictions object). All attribute labels are the same, except they are prefixed by newdata_.

library(modelbased)
library(marginaleffects)
m <- lm(mpg ~ hp + factor(cyl), mtcars)
p <- predictions(m, newdata = visualisation_matrix(at = "cyl"))
attributes(p) |> names()
#>  [1] "names"                  "class"                  "row.names"             
#>  [4] "model"                  "type"                   "model_type"            
#>  [7] "variables"              "newdata_adjusted_for"   "newdata_at_specs"      
#> [10] "newdata_at"             "newdata_preserve_range" "newdata_table_title"   
#> [13] "newdata_table_footer"

@vincentarelbundock
Copy link

btw, predictions() with visualization_matrix() as arg for newdata again fails for brms-models.

@strengejacke, would you have an easy example on hand? Here’s what I got when I tried:

library(brms)
library(insight)
library(modelbased)
library(marginaleffects)
mod <- insight::download_model("brms_1")

predictions(mod, newdata = visualisation_matrix(at = "cyl = c(2, 4, 6)"))
#>   rowid     type predicted cyl      mpg      wt conf.low conf.high
#> 1     1 response  26.32771   2 20.09062 3.21725 22.44373  29.81315
#> 2     2 response  23.35533   4 20.09062 3.21725 21.28254  25.37644
#> 3     3 response  20.35782   6 20.09062 3.21725 19.38795  21.27495

@strengejacke
Copy link
Member

strengejacke commented Jan 23, 2022

library(modelbased)
library(marginaleffects)

mod <- brms::brm(mpg ~ factor(cyl) + hp, data = mtcars, refresh = 0, iter = 200)

predictions(mod, newdata = visualisation_matrix(at = "cyl"))
#> Error: Variable 'cyl' was originally numeric but is not in 'newdata'.

Created on 2022-01-23 by the reprex package (v2.0.1)

@vincentarelbundock
Copy link

I don't get the discussion about the ellipses

I think Dominique brought up the ellipses in response to my code with marginaleffects::datagrid which uses a different user-input style, based on the ellipses. You can already do the same things with visualisation_matrix, so it's only a question of style and convenience:

visualisation_matrix(mod, at = "cyl = c(4, 8)")
datagrid(mod, cyl = c(4, 8)

In terms of functionality, they're the same, so I am not advocating a change here, just noting the difference.

why we should not move vis-mat to datawizard?

IMHO, it would make more sense in datawizard. The main downsides are time and effort, and making sure the function is re-exported by modelbased to ensure backward compatibility for those who don't load datawizard explicitly.

So the question is: is there enough demand to justify the effort? As explained above, I'm not sure I'll use it, since I already implemented something very similar, and people can now use visualisation_matrix in newdata if they like that better. But maybe someone else will use it eventually.

@strengejacke
Copy link
Member

I think this is most likely due to this here: easystats/modelbased#161 (comment)

If cyl is numeric in the original data, but coerced to factor within formula, visualization_matrix() coerces to factor to create a valid reference grid. However, this fails with brms. If we don't coerce to factor but keep cyl as numeric, it produces values from 4 to 8, including fractional parts - which are "invalid factor levels" when passed to predict() as newdata. We probably need to check if cyl is numeric in data but factor in formula, and if so, coerce to factor to generate the reference grid, but before visualization_matrix() returns the grid, it must be back-converted to numeric...

@strengejacke
Copy link
Member

The main downsides are time and effort,

If the discussion was about moving it to either insight or datawizard, the effort is the same. So I'd say it should go into datawizard. Furthermore, I suggest either having named vectors like in your datagrid() function to define at-values (so, evaluate ellipses), or have a list (like at = list(cyl = c(4, 8), hp = 150)). Specifying "cyl = c(4, 8)" as string seems strange to me.

@vincentarelbundock
Copy link

Yes, the lazy choice would be to leave it in modelbased, but I agree that datawizard would be a more intuitive landing spot.

visualization_matrix() coerces to factor to create a valid reference grid.

One approach would be to avoid coercing data types (which feels unsafe in general) by checking the new factor attribute assigned by insight and treating that variable accordingly.

Specifying "cyl = c(4, 8)" as string seems strange to me.

On aesthetic grounds, I'm not a big fan of this either, but backward compatibility should probably be maintained.

@vincentarelbundock
Copy link

Scratch my comment about insight factor attribute. It won't always be there in visualisation_matrix

@strengejacke
Copy link
Member

strengejacke commented Jan 23, 2022

Here I try to summarize the main problem in visualization_matrix() for in-formula conversion of numeric variables into factors:

library(insight)
library(modelbased)

m1 <- lm(mpg ~ hp + factor(cyl), mtcars)
m2 <- lm(mpg ~ hp + cyl, mtcars)

# numeric, marked as factor
get_data(m1) |> attr("factors")
#> [1] "cyl"
class(get_data(m1)$cyl)
#> [1] "numeric"

# numeric, not marked as factor
get_data(m2) |> attr("factors")
#> NULL
class(get_data(m2)$cyl)
#> [1] "numeric"

# if cyl is numeric, but used as factor inside formula
visualisation_matrix(m1, at = "cyl")
#> Visualisation Grid
#> 
#> cyl |     hp
#> ------------
#> 4   | 146.69
#> 6   | 146.69
#> 8   | 146.69
#> 
#> Maintained constant: hp
str(visualisation_matrix(m1, at = "cyl")$cyl)
#>  Factor w/ 3 levels "4","6","8": 1 2 3

# if cyl is numeric
visualisation_matrix(m2, at = "cyl")
#> Visualisation Grid
#> 
#>  cyl |     hp
#> -------------
#> 4.00 | 146.69
#> 4.44 | 146.69
#> 4.89 | 146.69
#> 5.33 | 146.69
#> 5.78 | 146.69
#> 6.22 | 146.69
#> 6.67 | 146.69
#> 7.11 | 146.69
#> 7.56 | 146.69
#> 8.00 | 146.69
#> 
#> Maintained constant: hp
str(visualisation_matrix(m2, at = "cyl")$cyl)
#>  num [1:10] 4 4.44 4.89 5.33 5.78 ...

# if "cyl" is treated as numeric in "visualisation_matrix()"
# it fails due to invalid levels
predict(m1, newdata = visualisation_matrix(m2, at = "cyl"))
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor factor(cyl) has new levels 4.44444444444444, 4.88888888888889, 5.33333333333333, 5.77777777777778, 6.22222222222222, 6.66666666666667, 7.11111111111111, 7.55555555555556

m3 <- brms::brm(mpg ~ hp + factor(cyl), mtcars, refresh = 0, iter = 100, chains = 2)
# if "cyl" is treated as factor in "visualisation_matrix()"
# it fails for brms because original cyl was numeric
predict(m3, newdata = visualisation_matrix(m3, at = "cyl"))
#> Error: Variable 'cyl' was originally numeric but is not in 'newdata'.

# --- solution ----

# if we create data grid with factor (to have "proper" at-values for cyl)
# and then convert "cyl" back to original numeric, it works for all models
vm_factor <- visualisation_matrix(m1, at = "cyl")
vm_factor$cyl <- insight:::.factor_to_numeric(vm_factor$cyl)

predict(m1, newdata = vm_factor)
#>        1        2        3 
#> 25.12392 19.15627 16.60307
predict(m2, newdata = vm_factor)
#>        1        2        3 
#> 25.04464 20.51526 15.98587
predict(m3, newdata = vm_factor)
#>      Estimate Est.Error     Q2.5    Q97.5
#> [1,] 25.96428  3.613558 19.36254 33.29545
#> [2,] 19.10267  3.321300 13.97888 25.33048
#> [3,] 16.15320  2.849919 10.43424 20.63559

Created on 2022-01-23 by the reprex package (v2.0.1)

@DominiqueMakowski
Copy link
Member Author

Regarding datawizard-insight-modelbased location, here are some of the arguments

  • In datawizard: it conceptually lies there as it mostly manipulates data
  • In insight:
    • It could be useful there as the prediction stuff lies there (and maybe for some other stuff)
    • It also conceptually lies there as it's primarily used in the context of models
    • If other packages were to re-use it / parts of it (ggeffects marginalffects, but which does not seem to be the case), having it there would avoid having to have another pseudo-dependency (a point brought up by Vincent)
  • in modelbased: no breaking change

I agree that evaluating as list is more intuitive, the reason for strings was to do some elaborate stuff like specifying ranges etc. (e.g., "mpg = [3, 6]" would actually be understood as a range rather than as a list)

@strengejacke
Copy link
Member

strengejacke commented Jan 24, 2022

@ list-argument: this is implemented, additionally to current behaviour, so no breaking changes.
@ location: maybe in insight as get_datagrid() or so, and in datawizard re-exported with alias visualization_matrix()? After reading your comment, I agree it would fit both into insight and datawizard (or in neither) - it is a bit model information (insight) and a bit data manipulation (datawizard). We could also just move it as visualization_matrix() to insight.

@DominiqueMakowski
Copy link
Member Author

maybe in insight as get_datagrid() or so, and in datawizard re-exported with alias visualization_matrix()

That's what i had in mind too. Maybe we can think of adding some features? Are there have any cool and useful features in ggeffects & marginaleffect that we could try reproducing so that it's on par feature-wise?

@strengejacke
Copy link
Member

I think what could be useful are handy shortcuts to define the at-values, see https://strengejacke.github.io/ggeffects/articles/introduction_effectsatvalues.html. @mattansb mentioned what he uses often is something like "x [meansd]", which calculates mean and +1/-1 SD as at-values.

@strengejacke
Copy link
Member

strengejacke commented Jan 24, 2022

btw, this fancy code is protected by patent from me. :-D

https://github.com/strengejacke/ggeffects/blob/2f5cf0a99c169dd9ee2eb1c92ba26559a8715147/R/pretty_range.R#L50-L76

(it calculates the perfect range for prettifying values)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants