move visualisation_matrix lower to make it more re-usable #44

DominiqueMakowski · 2021-12-10T03:18:26Z

A robust solution to generate the data / visualisation grid could be useful to other developers/packages as well (@vincentarelbundock), so we might want to see whether to move it to datawizard or insight. We could rename it something like data_grid() or get_datagrid() (if in insight), and reexport it under its current name (visualization_matrix) in modelbased to avoid any breaking changes.

What do you think?

The text was updated successfully, but these errors were encountered:

vincentarelbundock · 2021-12-10T12:50:51Z

Interesting idea.

I’m not super familiar with visualization_matrix, but took a quick
look at the documentation and it seems very cool.

FWIW, I already have a similar function called
typical:

library(marginaleffects)

dat <- mtcars[, c("mpg", "gear", "hp", "am", "vs")]
dat$gear <- as.factor(dat$gear)
mod <- lm(mpg ~ hp + am, mtcars)

typical(model = mod, hp = seq(111, 120, by = 5))

##        am  hp
## 1 0.40625 111
## 2 0.40625 116

typical(newdata = head(dat, 3))

##    mpg gear       hp am        vs
## 1 21.6    4 104.3333  1 0.3333333

AFAICT, the main differences are:

at values are either passed by parsing strings like "Petal.Width = seq(-3, 3)" (in visualization_matrix) or through ... (in
typical).
visualization_matrix allows users to specify which function to use
to summarize variables not mentioned explicitly (e.g., mean or
mode).

On the one hand, since I am already committed to ... strategy, I would
have to adopt a new dependency and write a wrapper to make things work
as before in typical. On the other hand, allowing different summary functions for
factors or numerics would just require a new argument and a couple lines
of code. So it seems like either way I'll have to do some work.

All this to say that, from a conceptual perspective, I agree it would
make more sense to put visualization_matrix in datawizard. However,
I’m not sure it makes sense to do work on my account, given that I’m not
even 100% sure I would adopt…

Thanks for the ping!

DominiqueMakowski · 2022-01-23T01:08:36Z

Would it be worth it to allow passing variables via ... in vizmatrix?

adopot a new dependency

Maybe in insight then? Though this function really manipulates data, it is meant to be used with models most of the time so it really falls between datawizard and insight

vincentarelbundock · 2022-01-23T13:23:32Z

Looks like the ... are already used for something else:

… Arguments passed to or from other methods (for instance, length or range to control the spread of numeric variables.).

I agree it’s often a good idea to make sure the functions are in the “right” package. But I’m also not sure if there’s enough immediate demand such that it would be worth much trouble to move things around.

FWIW, I just added support for visualisation_matrix to marginaleffects (github devel-only) so you can now use it directly without having to specify the x argument manually (saves a few key strokes):

library(modelbased)
library(marginaleffects)

mod <- lm(mpg ~ factor(cyl) + hp, mtcars)

predictions(mod, newdata = visualisation_matrix(at = "cyl"))
#>       type predicted std.error conf.low conf.high cyl      mpg       hp
#> 1 response  25.12392  1.368888 22.31988  27.92796   4 20.09062 146.6875
#> 2 response  19.15627  1.247190 16.60151  21.71102   6 20.09062 146.6875
#> 3 response  16.60307  1.278754 13.98366  19.22248   8 20.09062 146.6875

marginaleffects(mod, newdata = visualisation_matrix(at = "cyl"))
#>   rowid     type term contrast        dydx  std.error cyl      mpg       hp
#> 1     1 response  cyl    6 - 4 -5.96765508 1.63927762   4 20.09062 146.6875
#> 2     2 response  cyl    6 - 4 -5.96765508 1.63927762   6 20.09062 146.6875
#> 3     3 response  cyl    6 - 4 -5.96765508 1.63927762   8 20.09062 146.6875
#> 4     1 response  cyl    8 - 4 -8.52085075 2.32607494   4 20.09062 146.6875
#> 5     2 response  cyl    8 - 4 -8.52085075 2.32607494   6 20.09062 146.6875
#> 6     3 response  cyl    8 - 4 -8.52085075 2.32607494   8 20.09062 146.6875
#> 7     1 response   hp          -0.02403883 0.01540789   4 20.09062 146.6875
#> 8     2 response   hp          -0.02403883 0.01540790   6 20.09062 146.6875
#> 9     3 response   hp          -0.02403883 0.01540789   8 20.09062 146.6875

strengejacke · 2022-01-23T15:13:56Z

@vincentarelbundock when you look at str() of the visualization-matrix object, you see some special attributes. Could you also add attributes such as "adjusted_for", "at_specs", "at" or "preserve_range" to the object returned by predictions()?

ou may choose different names, but the general idea is to have this information available for potential plotting methods.

DominiqueMakowski · 2022-01-23T15:16:23Z

Looks like the ... are already used for something else:

well, we could query the ellipsis it and if the user passed something that corresponds to the name of a variable in the dataframe, then use it. The only edgecase would be if someone has a variable called length and does visualisation_matrix(mod, length = 3) as this would be an ambiguous formulation. But it's fairly easy to catch and throw a clear warning explaining it.

I guess it boils down to wether anyone (@strengejacke @mattansb ?) can think of any other usecase for a data-gridding beyond modelbased? wouldn't it be useful in insight I vaguely remember in the prediction's CIs computations we had to obtain the grid to get the covariance or something? perhaps it's a false memory I can blame that on long-covid

strengejacke · 2022-01-23T15:20:48Z

btw, predictions() with visualization_matrix() as arg for newdata again fails for brms-models. Should be related to easystats/modelbased#161.

strengejacke · 2022-01-23T15:21:57Z

Looks like the ... are already used for something else:

well, we could query the ellipsis it and if the user passed something that corresponds to the name of a variable in the dataframe, then use it. The only edgecase would be if someone has a variable called length and does visualisation_matrix(mod, length = 3) as this would be an ambiguous formulation. But it's fairly easy to catch and throw a clear warning explaining it.

I guess it boils down to wether anyone (@strengejacke @mattansb ?) can think of any other usecase for a data-gridding beyond modelbased? wouldn't it be useful in insight I vaguely remember in the prediction's CIs computations we had to obtain the grid to get the covariance or something? perhaps it's a false memory I can blame that on long-covid

Can you write some example code or similar? I don't get the discussion about the ellipses, and why we should not move vis-mat to datawizard?

vincentarelbundock · 2022-01-23T16:12:41Z

vincentarelbundock when you look at str() of the visualization-matrix object, you see some special attributes. Could you also add attributes such as "adjusted_for", "at_specs", "at" or "preserve_range" to the object returned by predictions()?

Absolutely. In the dev version I now attach all the attributes of the visualisation_matrix object except names, row.names, class, data, and reference (the original data is already included in the main predictions object). All attribute labels are the same, except they are prefixed by newdata_.

library(modelbased)
library(marginaleffects)
m <- lm(mpg ~ hp + factor(cyl), mtcars)
p <- predictions(m, newdata = visualisation_matrix(at = "cyl"))
attributes(p) |> names()
#>  [1] "names"                  "class"                  "row.names"             
#>  [4] "model"                  "type"                   "model_type"            
#>  [7] "variables"              "newdata_adjusted_for"   "newdata_at_specs"      
#> [10] "newdata_at"             "newdata_preserve_range" "newdata_table_title"   
#> [13] "newdata_table_footer"

vincentarelbundock · 2022-01-23T16:21:08Z

btw, predictions() with visualization_matrix() as arg for newdata again fails for brms-models.

@strengejacke, would you have an easy example on hand? Here’s what I got when I tried:

library(brms)
library(insight)
library(modelbased)
library(marginaleffects)
mod <- insight::download_model("brms_1")

predictions(mod, newdata = visualisation_matrix(at = "cyl = c(2, 4, 6)"))
#>   rowid     type predicted cyl      mpg      wt conf.low conf.high
#> 1     1 response  26.32771   2 20.09062 3.21725 22.44373  29.81315
#> 2     2 response  23.35533   4 20.09062 3.21725 21.28254  25.37644
#> 3     3 response  20.35782   6 20.09062 3.21725 19.38795  21.27495

strengejacke · 2022-01-23T16:29:17Z

library(modelbased)
library(marginaleffects)

mod <- brms::brm(mpg ~ factor(cyl) + hp, data = mtcars, refresh = 0, iter = 200)

predictions(mod, newdata = visualisation_matrix(at = "cyl"))
#> Error: Variable 'cyl' was originally numeric but is not in 'newdata'.

^{Created on 2022-01-23 by the reprex package (v2.0.1)}

vincentarelbundock · 2022-01-23T16:31:58Z

I don't get the discussion about the ellipses

I think Dominique brought up the ellipses in response to my code with marginaleffects::datagrid which uses a different user-input style, based on the ellipses. You can already do the same things with visualisation_matrix, so it's only a question of style and convenience:

visualisation_matrix(mod, at = "cyl = c(4, 8)")
datagrid(mod, cyl = c(4, 8)

In terms of functionality, they're the same, so I am not advocating a change here, just noting the difference.

why we should not move vis-mat to datawizard?

IMHO, it would make more sense in datawizard. The main downsides are time and effort, and making sure the function is re-exported by modelbased to ensure backward compatibility for those who don't load datawizard explicitly.

So the question is: is there enough demand to justify the effort? As explained above, I'm not sure I'll use it, since I already implemented something very similar, and people can now use visualisation_matrix in newdata if they like that better. But maybe someone else will use it eventually.

strengejacke · 2022-01-23T16:33:01Z

I think this is most likely due to this here: easystats/modelbased#161 (comment)

If cyl is numeric in the original data, but coerced to factor within formula, visualization_matrix() coerces to factor to create a valid reference grid. However, this fails with brms. If we don't coerce to factor but keep cyl as numeric, it produces values from 4 to 8, including fractional parts - which are "invalid factor levels" when passed to predict() as newdata. We probably need to check if cyl is numeric in data but factor in formula, and if so, coerce to factor to generate the reference grid, but before visualization_matrix() returns the grid, it must be back-converted to numeric...

strengejacke · 2022-01-23T16:36:56Z

The main downsides are time and effort,

If the discussion was about moving it to either insight or datawizard, the effort is the same. So I'd say it should go into datawizard. Furthermore, I suggest either having named vectors like in your datagrid() function to define at-values (so, evaluate ellipses), or have a list (like at = list(cyl = c(4, 8), hp = 150)). Specifying "cyl = c(4, 8)" as string seems strange to me.

vincentarelbundock · 2022-01-23T16:39:39Z

Yes, the lazy choice would be to leave it in modelbased, but I agree that datawizard would be a more intuitive landing spot.

visualization_matrix() coerces to factor to create a valid reference grid.

One approach would be to avoid coercing data types (which feels unsafe in general) by checking the new factor attribute assigned by insight and treating that variable accordingly.

Specifying "cyl = c(4, 8)" as string seems strange to me.

On aesthetic grounds, I'm not a big fan of this either, but backward compatibility should probably be maintained.

vincentarelbundock · 2022-01-23T16:40:38Z

Scratch my comment about insight factor attribute. It won't always be there in visualisation_matrix

strengejacke · 2022-01-23T17:03:40Z

Here I try to summarize the main problem in visualization_matrix() for in-formula conversion of numeric variables into factors:

library(insight)
library(modelbased)

m1 <- lm(mpg ~ hp + factor(cyl), mtcars)
m2 <- lm(mpg ~ hp + cyl, mtcars)

# numeric, marked as factor
get_data(m1) |> attr("factors")
#> [1] "cyl"
class(get_data(m1)$cyl)
#> [1] "numeric"

# numeric, not marked as factor
get_data(m2) |> attr("factors")
#> NULL
class(get_data(m2)$cyl)
#> [1] "numeric"

# if cyl is numeric, but used as factor inside formula
visualisation_matrix(m1, at = "cyl")
#> Visualisation Grid
#> 
#> cyl |     hp
#> ------------
#> 4   | 146.69
#> 6   | 146.69
#> 8   | 146.69
#> 
#> Maintained constant: hp
str(visualisation_matrix(m1, at = "cyl")$cyl)
#>  Factor w/ 3 levels "4","6","8": 1 2 3

# if cyl is numeric
visualisation_matrix(m2, at = "cyl")
#> Visualisation Grid
#> 
#>  cyl |     hp
#> -------------
#> 4.00 | 146.69
#> 4.44 | 146.69
#> 4.89 | 146.69
#> 5.33 | 146.69
#> 5.78 | 146.69
#> 6.22 | 146.69
#> 6.67 | 146.69
#> 7.11 | 146.69
#> 7.56 | 146.69
#> 8.00 | 146.69
#> 
#> Maintained constant: hp
str(visualisation_matrix(m2, at = "cyl")$cyl)
#>  num [1:10] 4 4.44 4.89 5.33 5.78 ...

# if "cyl" is treated as numeric in "visualisation_matrix()"
# it fails due to invalid levels
predict(m1, newdata = visualisation_matrix(m2, at = "cyl"))
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor factor(cyl) has new levels 4.44444444444444, 4.88888888888889, 5.33333333333333, 5.77777777777778, 6.22222222222222, 6.66666666666667, 7.11111111111111, 7.55555555555556

m3 <- brms::brm(mpg ~ hp + factor(cyl), mtcars, refresh = 0, iter = 100, chains = 2)
# if "cyl" is treated as factor in "visualisation_matrix()"
# it fails for brms because original cyl was numeric
predict(m3, newdata = visualisation_matrix(m3, at = "cyl"))
#> Error: Variable 'cyl' was originally numeric but is not in 'newdata'.

# --- solution ----

# if we create data grid with factor (to have "proper" at-values for cyl)
# and then convert "cyl" back to original numeric, it works for all models
vm_factor <- visualisation_matrix(m1, at = "cyl")
vm_factor$cyl <- insight:::.factor_to_numeric(vm_factor$cyl)

predict(m1, newdata = vm_factor)
#>        1        2        3 
#> 25.12392 19.15627 16.60307
predict(m2, newdata = vm_factor)
#>        1        2        3 
#> 25.04464 20.51526 15.98587
predict(m3, newdata = vm_factor)
#>      Estimate Est.Error     Q2.5    Q97.5
#> [1,] 25.96428  3.613558 19.36254 33.29545
#> [2,] 19.10267  3.321300 13.97888 25.33048
#> [3,] 16.15320  2.849919 10.43424 20.63559

^{Created on 2022-01-23 by the reprex package (v2.0.1)}

DominiqueMakowski · 2022-01-24T06:59:16Z

Regarding datawizard-insight-modelbased location, here are some of the arguments

In datawizard: it conceptually lies there as it mostly manipulates data
In insight:
- It could be useful there as the prediction stuff lies there (and maybe for some other stuff)
- It also conceptually lies there as it's primarily used in the context of models
- If other packages were to re-use it / parts of it (ggeffects marginalffects, but which does not seem to be the case), having it there would avoid having to have another pseudo-dependency (a point brought up by Vincent)
in modelbased: no breaking change

I agree that evaluating as list is more intuitive, the reason for strings was to do some elaborate stuff like specifying ranges etc. (e.g., "mpg = [3, 6]" would actually be understood as a range rather than as a list)

strengejacke · 2022-01-24T07:59:29Z

@ list-argument: this is implemented, additionally to current behaviour, so no breaking changes.
@ location: maybe in insight as get_datagrid() or so, and in datawizard re-exported with alias visualization_matrix()? After reading your comment, I agree it would fit both into insight and datawizard (or in neither) - it is a bit model information (insight) and a bit data manipulation (datawizard). We could also just move it as visualization_matrix() to insight.

DominiqueMakowski · 2022-01-24T09:05:54Z

maybe in insight as get_datagrid() or so, and in datawizard re-exported with alias visualization_matrix()

That's what i had in mind too. Maybe we can think of adding some features? Are there have any cool and useful features in ggeffects & marginaleffect that we could try reproducing so that it's on par feature-wise?

strengejacke · 2022-01-24T10:36:20Z

I think what could be useful are handy shortcuts to define the at-values, see https://strengejacke.github.io/ggeffects/articles/introduction_effectsatvalues.html. @mattansb mentioned what he uses often is something like "x [meansd]", which calculates mean and +1/-1 SD as at-values.

strengejacke · 2022-01-24T11:02:56Z

I think relevant code that could be re-used is:
https://github.com/strengejacke/ggeffects/blob/master/R/utils_get_representative_values.R

and some of the functions used in that code:
https://github.com/strengejacke/ggeffects/blob/master/R/pretty_range.R
https://github.com/strengejacke/ggeffects/blob/master/R/moderator_pattern.R

strengejacke · 2022-01-24T11:05:23Z

btw, this fancy code is protected by patent from me. :-D

https://github.com/strengejacke/ggeffects/blob/2f5cf0a99c169dd9ee2eb1c92ba26559a8715147/R/pretty_range.R#L50-L76

(it calculates the perfect range for prettifying values)

DominiqueMakowski added the Discussion 🦜 label Dec 10, 2021

IndrajeetPatil assigned DominiqueMakowski Jan 22, 2022

strengejacke mentioned this issue Jan 23, 2022

update vis-matrix easystats/modelbased#168

Merged

strengejacke added a commit to easystats/modelbased that referenced this issue Jan 23, 2022

allow names list for "at" (see easystats/datawizard#44)

88492d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move visualisation_matrix lower to make it more re-usable #44

move visualisation_matrix lower to make it more re-usable #44

DominiqueMakowski commented Dec 10, 2021

vincentarelbundock commented Dec 10, 2021

DominiqueMakowski commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022

DominiqueMakowski commented Jan 23, 2022

strengejacke commented Jan 23, 2022

strengejacke commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022 •

edited

Loading

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022

strengejacke commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022 •

edited

Loading

DominiqueMakowski commented Jan 24, 2022

strengejacke commented Jan 24, 2022 •

edited

Loading

DominiqueMakowski commented Jan 24, 2022

strengejacke commented Jan 24, 2022

strengejacke commented Jan 24, 2022

strengejacke commented Jan 24, 2022 •

edited

Loading

move visualisation_matrix lower to make it more re-usable #44

move visualisation_matrix lower to make it more re-usable #44

Comments

DominiqueMakowski commented Dec 10, 2021

vincentarelbundock commented Dec 10, 2021

DominiqueMakowski commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022

DominiqueMakowski commented Jan 23, 2022

strengejacke commented Jan 23, 2022

strengejacke commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022 • edited Loading

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022

strengejacke commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

vincentarelbundock commented Jan 23, 2022

strengejacke commented Jan 23, 2022 • edited Loading

DominiqueMakowski commented Jan 24, 2022

strengejacke commented Jan 24, 2022 • edited Loading

DominiqueMakowski commented Jan 24, 2022

strengejacke commented Jan 24, 2022

strengejacke commented Jan 24, 2022

strengejacke commented Jan 24, 2022 • edited Loading

strengejacke commented Jan 23, 2022 •

edited

Loading

strengejacke commented Jan 23, 2022 •

edited

Loading

strengejacke commented Jan 24, 2022 •

edited

Loading

strengejacke commented Jan 24, 2022 •

edited

Loading