Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_collinearity() does not work with orthogonal polynomials #733

Closed
jmgirard opened this issue Jun 12, 2024 · 10 comments · Fixed by #786
Closed

check_collinearity() does not work with orthogonal polynomials #733

jmgirard opened this issue Jun 12, 2024 · 10 comments · Fixed by #786
Assignees
Labels
Docs 📚 Something to be adressed in docs and/or vignettes

Comments

@jmgirard
Copy link
Contributor

jmgirard commented Jun 12, 2024

Related to easystats/see#346

I know that orthogonal polynomials won't be collinear (that's their point). But I want to be able to show that while teaching.

library(easystats)
#> # Attaching packages: easystats 0.7.2 (red = needs update)
#> ✔ bayestestR  0.13.2   ✔ correlation 0.8.4 
#> ✔ datawizard  0.11.0   ✔ effectsize  0.8.8 
#> ✖ insight     0.20.0   ✖ modelbased  0.8.7 
#> ✔ performance 0.12.0   ✔ parameters  0.21.7
#> ✔ report      0.5.8    ✔ see         0.8.4 
#> 
#> Restart the R-Session and update packages with `easystats::easystats_update()`.
x <- runif(n = 100, min = 100, max = 110)
y <- 0.2 + 1.2 * x + rnorm(100, 0, 1)
df <- data.frame(x, y)

fit <- lm(y ~ poly(x, degree = 2), data = df)
check_collinearity(fit)
#> Not enough model terms in the conditional part of the model to check for
#>   multicollinearity.
#> NULL

Created on 2024-06-12 with reprex v2.1.0

When other variables are included, there is no error but the orthogonal polynomials are considered a single term. Is that correct?

library(easystats)

x <- runif(n = 100, min = 100, max = 110)
g <- sample(c(0, 1), size = 100, replace = TRUE)
y <- 0.2 + 1.2 * x + 0.4 * g + rnorm(100, 0, 1)

df <- data.frame(x, y, g)

fit <- lm(y ~ poly(x, degree = 2) + g, data = df)
check_collinearity(fit)
#> # Check for Multicollinearity
#> 
#> Low Correlation
#> 
#>                 Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>  poly(x, degree = 2) 1.04 [1.00, 4.96]         1.02      0.96     [0.20, 1.00]
#>                    g 1.04 [1.00, 4.96]         1.02      0.96     [0.20, 1.00]

Created on 2024-06-12 with reprex v2.1.0

@strengejacke
Copy link
Member

Have you checked with car::VIF()?

@jmgirard
Copy link
Contributor Author

jmgirard commented Jun 12, 2024

car returns an error:

x <- runif(n = 100, min = 100, max = 110)
y <- 0.2 + 1.2 * x + rnorm(100, 0, 1)
df <- data.frame(x, y)
fit <- lm(y ~ poly(x, degree = 2), data = df)
performance::check_collinearity(fit)
#> Not enough model terms in the conditional part of the model to check for
#>   multicollinearity.
#> NULL
car::vif(fit)
#> Error in vif.default(fit): model contains fewer than 2 terms

Created on 2024-06-12 with reprex v2.1.0

@mattansb
Copy link
Member

When other variables are included, there is no error but the orthogonal polynomials are considered a single term. Is that correct?

I think that makes sense - there is only 1 underlying variable, and adding orthogonal functions of the variable won't have any collinearity between them.

@jmgirard
Copy link
Contributor Author

jmgirard commented Jun 12, 2024

When other variables are included, there is no error but the orthogonal polynomials are considered a single term. Is that correct?

I think that makes sense - there is only 1 underlying variable, and adding orthogonal functions of the variable won't have any collinearity between them.

They won't have collinearity and they are powered by a single variable, but technically won't there be one term per degree? And in some weird world, couldn't it be possible for the different terms to have different VIFs in a model with additional variables?

@jmgirard
Copy link
Contributor Author

Also, note that this same behavior occurs even when using raw polynomials, which definitely could be collinear:

x <- runif(n = 100, min = 100, max = 110)
y <- 0.2 + 1.2 * x + rnorm(100, 0, 1)
df <- data.frame(x, y)
fit <- lm(y ~ poly(x, degree = 2, raw = TRUE), data = df)
performance::check_collinearity(fit)
#> Not enough model terms in the conditional part of the model to check for
#>   multicollinearity.
#> NULL
car::vif(fit)
#> Error in vif.default(fit): model contains fewer than 2 terms

Created on 2024-06-12 with reprex v2.1.0

@mattansb
Copy link
Member

poly is considered one term - like factors or matrices:

data("mtcars")

mod <- lm(mpg ~ cbind(disp, hp) + poly(qsec, 2) + factor(cyl),
          data = mtcars)

insight::find_terms(mod)
#> $response
#> [1] "mpg"
#> 
#> $conditional
#> [1] "cbind(disp, hp)" "poly(qsec, 2)"   "factor(cyl)"
anova(mod)
#> Analysis of Variance Table
#> 
#> Response: mpg
#>                 Df Sum Sq Mean Sq F value  Pr(>F)    
#> cbind(disp, hp)  2 842.55  421.28 51.7367 1.3e-09 ***
#> poly(qsec, 2)    2  11.95    5.97  0.7337 0.49020    
#> factor(cyl)      2  67.98   33.99  4.1742 0.02728 *  
#> Residuals       25 203.57    8.14                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
performance::check_collinearity(mod)
#> # Check for Multicollinearity
#> 
#> Low Correlation
#> 
#>           Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>  poly(qsec, 2) 4.39 [ 2.91,  6.99]         2.09      0.23     [0.14, 0.34]
#> 
#> Moderate Correlation
#> 
#>         Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>  factor(cyl) 9.45 [ 5.99, 15.31]         3.07      0.11     [0.07, 0.17]
#> 
#> High Correlation
#> 
#>             Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>  cbind(disp, hp) 23.03 [14.25, 37.63]         4.80      0.04     [0.03, 0.07]

Created on 2024-06-13 with reprex v2.1.0

If you want each feature to be a term, you will need to hand code them.

mod <- lm(mpg ~ disp + hp + qsec + I(qsec^2) + I(cyl==6) + I(cyl==8),
          data = mtcars)

insight::find_terms(mod)
#> $response
#> [1] "mpg"
#> 
#> $conditional
#> [1] "disp"        "hp"          "qsec"        "I(qsec^2)"   "I(cyl == 6)"
#> [6] "I(cyl == 8)"
anova(mod)
#> Analysis of Variance Table
#> 
#> Response: mpg
#>             Df Sum Sq Mean Sq F value    Pr(>F)    
#> disp         1 808.89  808.89 99.3390 3.428e-10 ***
#> hp           1  33.67   33.67  4.1344   0.05277 .  
#> qsec         1   6.71    6.71  0.8235   0.37282    
#> I(qsec^2)    1   5.24    5.24  0.6438   0.42990    
#> I(cyl == 6)  1  59.13   59.13  7.2616   0.01241 *  
#> I(cyl == 8)  1   8.85    8.85  1.0867   0.30718    
#> Residuals   25 203.57    8.14                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
performance::check_collinearity(mod)
#> # Check for Multicollinearity
#> 
#> Low Correlation
#> 
#>         Term  VIF       VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>  I(cyl == 6) 2.09 [  1.53,   3.25]         1.45      0.48     [0.31, 0.65]
#> 
#> Moderate Correlation
#> 
#>  Term  VIF       VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>  disp 7.88 [  5.03,  12.72]         2.81      0.13     [0.08, 0.20]
#>    hp 9.03 [  5.74,  14.62]         3.01      0.11     [0.07, 0.17]
#> 
#> High Correlation
#> 
#>         Term    VIF       VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#>         qsec 348.67 [212.31, 573.00]        18.67  2.87e-03     [0.00, 0.00]
#>    I(qsec^2) 320.53 [195.20, 526.75]        17.90  3.12e-03     [0.00, 0.01]
#>  I(cyl == 8)  11.38 [  7.17,  18.48]         3.37      0.09     [0.05, 0.14]

Created on 2024-06-13 with reprex v2.1.0

@jmgirard
Copy link
Contributor Author

If I modified the function to do the following, would you accept the PR?

  • Check if there is only one term
  • If so, check if it is a poly term
  • If so, check if raw is true or false
  • If true, refit with I() and output collinearity
  • If false, output a note saying there is no collinearity

@bwiernik
Copy link
Contributor

I don't think it makes sense to treat polynomials as being more than one variable/term. The collinearity of polynomial terms is non-essential, so VIF diagnostics don't make much sense for them. Collinearity isn't diagnostic of model issues for polynomial terms (raw or not)--the use of orthogonal polynomials is for computational reasons, not model validity reasons.

@jmgirard
Copy link
Contributor Author

jmgirard commented Jun 15, 2024

That makes sense to me. What about just a message when giving it only a poly term that explains this?

For example:

check_collinearity(lm(y ~ poly(x, 2), data = df))
#> Message: Polynomial transformations are considered a single term and thus VIFs are not calculated between them.

@DominiqueMakowski
Copy link
Member

I think a note in the details section of the function documentation should suffice to avoid a very verbose messaging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs 📚 Something to be adressed in docs and/or vignettes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants