-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong DFs in model_parameters.fixest
#892
Comments
I haven't dug into this, but I would assume the correct df here is 9 given that it's a fixed effects estimate, so the get_dof() is what I would expect to be wrong |
I don’t think so. This @lrberge, what is the best way for us to extract the relevant degrees of freedom to compute the p value with library(parameters)
library(fixest)
library(carData)
d <- Greene
d$dv <- ifelse(Greene$decision == 'yes', 1, 0)
# These two models are equivalent
coef(feglm(dv ~ language | judge,
data = d,
family = 'logit'))["languageFrench"]
# languageFrench
# -0.6837183
coef(glm(dv ~ language + factor(judge), d, family = binomial))["languageFrench"]
# languageFrench
# -0.6837183 |
That's fair--like I said I hadn't dug into the example--what's the p value reported by the package itself? |
Not 100% sure, but the package reports a much smaller p, so it's either using |
I can investigate this weekend |
Degrees of freedom are tricky because they depend on the type of standard error computed. In particular, when SEs are clustered, the variables nested in the clusters are discarded from DoF calculation. So that they differ from the regular DoF we have in mind. I think that's what happens in your example. To get the DoF as used in fixest, there is the function degrees_freedom. To get the one we have in mind (N - all vars), there is degrees_freedom_iid. I hope it helps! |
There are three types of df in library(parameters)
library(fixest)
#>
#> Attaching package: 'fixest'
#> The following objects are masked from 'package:parameters':
#>
#> demean, dof
library(carData)
d <- Greene
d$dv <- ifelse(Greene$decision == 'yes', 1, 0)
mod <- feglm(dv ~ language | judge,
data = d,
cluster = c('judge'), family = 'logit')
summary(mod)
#> GLM estimation, family = binomial(link = "logit"), Dep. Var.: dv
#> Observations: 384
#> Fixed-effects: judge: 10
#> Standard-errors: Clustered (judge)
#> Estimate Std. Error t value Pr(>|t|)
#> languageFrench -0.683718 0.275539 -2.48139 0.013087 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -211.1 Adj. Pseudo R2: 0.053406
#> BIC: 487.6 Squared Cor.: 0.114888
ci(mod, method = "wald")
#> Parameter CI CI_low CI_high
#> 1 languageFrench 0.95 -1.30703 -0.06040618
ci(mod, method = "residual")
#> Parameter CI CI_low CI_high
#> 1 languageFrench 0.95 -1.225481 -0.1419557
p_value(mod, method = "wald")
#> Parameter p
#> 1 languageFrench 0.03491193
p_value(mod, method = "residual")
#> Parameter p
#> 1 languageFrench 0.01351663
model_parameters(mod, ci_type = "wald")
#> # Fixed Effects
#>
#> Parameter | Log-Odds | SE | 95% CI | z | df | p
#> -------------------------------------------------------------------------
#> language [French] | -0.68 | 0.28 | [-1.31, -0.06] | -2.48 | 9 | 0.035
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald z-distribution approximation.
#>
#> The model has a log- or logit-link. Consider using `exponentiate =
#> TRUE` to interpret coefficients as ratios.
model_parameters(mod, ci_type = "residual")
#> # Fixed Effects
#>
#> Parameter | Log-Odds | SE | 95% CI | z | df | p
#> -------------------------------------------------------------------------
#> language [French] | -0.68 | 0.28 | [-1.31, -0.06] | -2.48 | 9 | 0.035
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald z-distribution approximation. Created on 2023-08-15 with reprex v2.0.2 |
What would be good to know is a) whether I guess, fixing |
(cf. Line 25 in 81ebf40
|
I'm on holidays w/t computer so can't be very thorough and sorry my first reply was off, but:
|
There's a mismatch between library(fixest)
library(carData)
d <- Greene
d$dv <- ifelse(Greene$decision == 'yes', 1, 0)
# These two models are equivalent
m <- feglm(dv ~ language | judge,
data = d,
family = 'logit')
# z-value?
m$coeftable
#> Estimate Std. Error z value Pr(>|z|)
#> languageFrench -0.6837183 0.3241979 -2.108953 0.03494862
# t-value?
summary(m)
#> GLM estimation, family = binomial(link = "logit"), Dep. Var.: dv
#> Observations: 384
#> Fixed-effects: judge: 10
#> Standard-errors: Clustered (judge)
#> Estimate Std. Error t value Pr(>|t|)
#> languageFrench -0.683718 0.275539 -2.48139 0.013087 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -211.1 Adj. Pseudo R2: 0.053406
#> BIC: 487.6 Squared Cor.: 0.114888 Created on 2023-08-23 with reprex v2.0.2 |
Indeed, it's normal. Since computing VCOVs is costly, it is something that is delayed in Once you pass summary(), it's the "right" standard-errors. I advise in the doc ()I think but maybe not! to use user-level methods to access the objects, like the function coeftable, but it's not super clear, and, importantly it is not the standard in R (where we're used to access stuff directly in the object). Sorry for the implementation that creates confusion. |
library(parameters)
library(fixest)
#>
#> Attaching package: 'fixest'
#> The following objects are masked from 'package:parameters':
#>
#> demean, dof
library(carData)
d <- Greene
d$dv <- ifelse(Greene$decision == 'yes', 1, 0)
mod <- feglm(dv ~ language | judge,
data = d,
cluster = c('judge'), family = 'logit')
summary(mod)
#> GLM estimation, family = binomial(link = "logit"), Dep. Var.: dv
#> Observations: 384
#> Fixed-effects: judge: 10
#> Standard-errors: Clustered (judge)
#> Estimate Std. Error t value Pr(>|t|)
#> languageFrench -0.683718 0.275539 -2.48139 0.013087 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -211.1 Adj. Pseudo R2: 0.053406
#> BIC: 487.6 Squared Cor.: 0.114888
model_parameters(mod)
#> # Fixed Effects
#>
#> Parameter | Log-Odds | SE | 95% CI | t(382) | p
#> ---------------------------------------------------------------------
#> language [French] | -0.68 | 0.28 | [-1.23, -0.14] | -2.48 | 0.014
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald t-distribution approximation.
#>
#> The model has a log- or logit-link. Consider using `exponentiate =
#> TRUE` to interpret coefficients as ratios. Created on 2023-09-09 with reprex v2.0.2 Do we have any examples for z-statistic models? |
(requires easystats/insight#802 to be installed to work correctly) |
Ok, this is confusing. When do I know that when I have a t-statistic, whether we need residual- or t-df to compute p-values? library(parameters)
library(fixest)
#>
#> Attaching package: 'fixest'
#> The following objects are masked from 'package:parameters':
#>
#> demean, dof
library(carData)
data("qol_cancer")
d <- Greene
d$dv <- ifelse(Greene$decision == "yes", 1, 0)
qol_cancer <- cbind(
qol_cancer,
datawizard::demean(qol_cancer, select = c("phq4", "QoL"), group = "ID")
)
mod1 <- feglm(dv ~ language | judge,
data = d,
cluster = c("judge"), family = "logit"
)
mod2 <- fixest::feols(
QoL ~ time + phq4 | ID,
data = qol_cancer
)
summary(mod1)
#> GLM estimation, family = binomial(link = "logit"), Dep. Var.: dv
#> Observations: 384
#> Fixed-effects: judge: 10
#> Standard-errors: Clustered (judge)
#> Estimate Std. Error t value Pr(>|t|)
#> languageFrench -0.683718 0.275539 -2.48139 0.013087 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -211.1 Adj. Pseudo R2: 0.053406
#> BIC: 487.6 Squared Cor.: 0.114888
statistic <- insight::get_statistic(mod1)$Statistic
statistic
#> [1] -2.481386
# no "t" df here, mismatch between p and p from summary
degrees_freedom(mod1, type = "t")
#> [1] 9
dof <- 9
2 * stats::pt(abs(statistic), df = dof, lower.tail = FALSE)
#> [1] 0.03491193
# resid-df are close and *almost* match "p" from summary
degrees_freedom(mod1, type = "resid")
#> [1] 382
dof <- 382
2 * stats::pt(abs(statistic), df = dof, lower.tail = FALSE)
#> [1] 0.01351663
# seems like Inf seems correct here
dof <- Inf
2 * stats::pt(abs(statistic), df = dof, lower.tail = FALSE)
#> [1] 0.01308724
summary(mod2)
#> OLS estimation, Dep. Var.: QoL
#> Observations: 564
#> Fixed-effects: ID: 188
#> Standard-errors: Clustered (ID)
#> Estimate Std. Error t value Pr(>|t|)
#> time 1.08785 0.669197 1.62561 1.0572e-01
#> phq4 -3.66052 0.671116 -5.45438 1.5425e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 10.1 Adj. R2: 0.613325
#> Within R2: 0.179856
statistic <- insight::get_statistic(mod2)$Statistic
statistic
#> [1] 1.625609 -5.454377
# now t-df are correct and match "p" from summary
degrees_freedom(mod2, type = "t")
#> [1] 187
dof <- 187
2 * stats::pt(abs(statistic), df = dof, lower.tail = FALSE)
#> [1] 1.057170e-01 1.542519e-07
# resid-df are *not* correct and do not match "p" from summary
degrees_freedom(mod2, type = "resid")
#> [1] 561
dof <- 561
2 * stats::pt(abs(statistic), df = dof, lower.tail = FALSE)
#> [1] 1.045945e-01 7.379301e-08
# Inf not correct
dof <- Inf
2 * stats::pt(abs(statistic), df = dof, lower.tail = FALSE)
#> [1] 1.040329e-01 4.914483e-08 Created on 2023-09-09 with reprex v2.0.2 |
To summarize:
|
So we need to check > mod1$method
[1] "feglm"
> mod1$method_type
[1] "feglm"
> mod2$method
[1] "feols"
> mod2$method_type
[1] "feols" ? |
The z value is used for binomial and Poisson models (quasi families use t). This is identical to what To be clear: library(fixest)
# tiny data set
base = setNames(head(iris, 10), c("y", "x1", "x2", "x3", "species"))
set.seed(1)
base$y01 = sample(0:1, nrow(base), TRUE)
gather_pvalues = function(family){
est_fixest = feglm(y01 ~ x1 + x2, base, family = family, vcov = iid ~ ssc(adj = FALSE))
est_glm = glm(y01 ~ x1 + x2, base, family = family)
tstat = tstat(est_fixest, keep = "x1")
res = c(fixest = pvalue(est_fixest, keep = "x1"),
glm = pvalue(est_glm, keep = "x1"),
normal = pnorm(-abs(tstat)) * 2,
"t" = pt(-abs(tstat), degrees_freedom(est_fixest, "resid")) * 2)
round(res, 4)
}
# N = normal; t = Student
cbind("logit (N)" = gather_pvalues(binomial()),
"probit (N)" = gather_pvalues(binomial("probit")),
"poisson (N)" = gather_pvalues(poisson()),
"quasipoisson (t)" = gather_pvalues(quasipoisson()),
"gaussian (t)" = gather_pvalues(gaussian()))
#> logit (N) probit (N) poisson (N) quasipoisson (t) gaussian (t)
#> fixest.x1 0.2139 0.2099 0.3168 0.3275 0.2707
#> glm.x1 0.2139 0.2099 0.3168 0.3275 0.2707
#> normal.x1 0.2139 0.2099 0.3168 0.2925 0.2318
#> t.x1 0.2539 0.2502 0.3501 0.3275 0.2707 BTW there was a century old display bug in |
thanks, that helped me understand how to fix this in parameters! |
thanks all for your work and insights into this. I really appreciate your time! |
parameters::model_parameters
appears to use the wrong degrees of freedom to compute p values in this model, which results in an overly conservative test:The text was updated successfully, but these errors were encountered: