forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-inference-for-regression.Rmd
891 lines (622 loc) · 59.4 KB
/
10-inference-for-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
# Inference for Regression {#inference-for-regression}
```{r setup_inference_regression, include=FALSE, purl=FALSE}
# Used to define Learning Check numbers:
chap <- 10
lc <- 0
# Set R code chunk defaults:
opts_chunk$set(
echo = TRUE,
eval = TRUE,
warning = FALSE,
message = TRUE,
tidy = FALSE,
purl = TRUE,
out.width = "\\textwidth",
fig.height = 4,
fig.align = "center"
)
# Set output digit precision
options(scipen = 99, digits = 3)
# Set random number generator see value for replicable pseudorandomness.
set.seed(76)
```
In our penultimate chapter, we'll revisit the regression models we first studied in Chapters \@ref(regression) and \@ref(multiple-regression). Armed with our knowledge of confidence intervals and hypothesis tests from Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing), we'll be able to apply statistical inference to further our understanding of relationships between outcome and explanatory variables.
### Needed packages {-#inf-packages}
Let's load all the packages needed for this chapter (this assumes you've already installed them). Recall from our discussion in Section \@ref(tidyverse-package) that loading the `tidyverse` package by running `library(tidyverse)` loads the following commonly used data science packages all at once:
* `ggplot2` for data visualization
* `dplyr` for data wrangling
* `tidyr` for converting data to "tidy" format
* `readr` for importing spreadsheet data into R
* As well as the more advanced `purrr`, `tibble`, `stringr`, and `forcats` packages
If needed, read Section \@ref(packages) for information on how to install and load R packages.
```{r message=FALSE}
library(tidyverse)
library(moderndive)
library(infer)
```
```{r message=FALSE, echo=FALSE, purl=FALSE}
# Packages needed internally, but not in text.
library(tidyr)
library(kableExtra)
library(patchwork)
```
## Regression refresher
Before jumping into inference for regression, let's remind ourselves of the University of Texas Austin teaching evaluations analysis in Section \@ref(model1).
### Teaching evaluations analysis
Recall using simple linear regression \index{regression!simple linear} we modeled the relationship between
1. A numerical outcome variable $y$ (the instructor's teaching score) and
1. A single numerical explanatory variable $x$ (the instructor's "beauty" score).
We first created an `evals_ch5` data frame that selected a subset of variables from the `evals` data frame included in the `moderndive` package. This `evals_ch5` data frame contains only the variables of interest for our analysis, in particular the instructor's teaching `score` and the "beauty" rating `bty_avg`:
```{r}
evals_ch5 <- evals %>%
select(ID, score, bty_avg, age)
glimpse(evals_ch5)
```
```{r, echo=FALSE, purl=FALSE}
cor_ch6 <- evals_ch5 %>%
summarize(correlation = cor(score, bty_avg)) %>%
pull(correlation) %>%
round(3)
```
In Subsection \@ref(model1EDA), we performed an exploratory data analysis of the relationship between these two variables of `score` and `bty_avg`. We saw there that a weakly positive correlation of `r cor_ch6` existed between the two variables.
This was evidenced in Figure \@ref(fig:regline) of the scatterplot along with the "best-fitting" regression line that summarizes the linear relationship between the two variables of `score` and `bty_avg`. Recall in Subsection \@ref(leastsquares) that we defined a "best-fitting" line as the line that minimizes the *sum of squared residuals*.
```{r regline, fig.cap="Relationship with regression line.", fig.height=3.2, message=FALSE}
ggplot(evals_ch5,
aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "Beauty Score",
y = "Teaching Score",
title = "Relationship between teaching and beauty scores") +
geom_smooth(method = "lm", se = FALSE)
```
Looking at this plot again, you might be asking, "Does that line really have all that positive of a slope?". It does increase from left to right as the `bty_avg` variable increases, but by how much? To get to this information, recall that we followed a two-step procedure:
1. We first "fit" the linear regression model using the `lm()` function with the formula `score ~ bty_avg`. We saved this model in `score_model`.
1. We get the regression table by applying the `get_regression_table()`\index{moderndive!get\_regression\_table()} function from the `moderndive` package to `score_model`.
```{r, eval=FALSE}
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
# Get regression table:
get_regression_table(score_model)
```
```{r regtable-11, echo=FALSE, purl=FALSE}
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
get_regression_table(score_model) %>%
kable(
digits = 3,
caption = "Previously seen linear regression table",
booktabs = TRUE,
linesep = ""
) %>%
kable_styling(
font_size = ifelse(is_latex_output(), 10, 16),
latex_options = c("hold_position")
)
# slope:
slope_row <- get_regression_table(score_model) %>%
filter(term == "bty_avg")
b1 <- slope_row %>% pull(estimate)
se1 <- slope_row %>% pull(std_error)
t1 <- slope_row %>% pull(statistic)
lower1 <- slope_row %>% pull(lower_ci)
upper1 <- slope_row %>% pull(upper_ci)
# intercept:
intercept_row <- get_regression_table(score_model) %>%
filter(term == "intercept")
b0 <- intercept_row %>% pull(estimate)
se0 <- intercept_row %>% pull(std_error)
t0 <- intercept_row %>% pull(statistic)
lower0 <- intercept_row %>% pull(lower_ci)
upper0 <- intercept_row %>% pull(upper_ci)
# keep trailing zero for b0 for consistency with print edition
b0_trailing <- intercept_row %>%
pull(estimate) %>%
round(3) %>%
format(nsmall = 3)
```
Using the values in the `estimate` column of the resulting regression table in Table \@ref(tab:regtable-11), we could then obtain the equation of the "best-fitting" regression line in Figure \@ref(fig:regline):
$$
\begin{aligned}
\widehat{y} &= b_0 + b_1 \cdot x\\
\widehat{\text{score}} &= b_0 + b_{\text{bty}\_\text{avg}} \cdot\text{bty}\_\text{avg}\\
&= `r b0_trailing` + `r b1`\cdot\text{bty}\_\text{avg}
\end{aligned}
$$
where $b_0$ is the fitted intercept and $b_1$ is the fitted slope for `bty_avg`. Recall the interpretation of the $b_1$ = `r b1` value of the fitted slope:
> For every increase of one unit in "beauty" rating, there is an associated increase, on average, of `r b1` units of evaluation score.
Thus, the slope value quantifies the relationship between the $y$ variable `score` and the $x$ variable `bty_avg`. We also discussed the intercept value of $b_0$ = `r b0` and its lack of practical interpretation, since the range of possible "beauty" scores does not include 0.
### Sampling scenario
Let's now revisit this study in terms of the terminology and notation related to sampling we studied in Subsection \@ref(terminology-and-notation).
```{r echo=FALSE, purl=FALSE}
# This code is used for dynamic non-static in-line text output purposes
n_evals_ch5 <- evals_ch5 %>% nrow()
```
First, let's view the instructors for these `r n_evals_ch5` courses as a *representative sample* from a greater *study population*. In our case, let's assume that the study population is *all* instructors at UT Austin and that the sample of instructors who taught these `r n_evals_ch5` courses is a representative sample. Unfortunately, we can only *assume* these two facts without more knowledge of the *sampling methodology*\index{sampling methodology} used by the researchers.
Since we are viewing these $n$ = `r n_evals_ch5` courses as a sample, we can view our fitted slope $b_1$ = `r b1` as a *point estimate* of the *population slope* $\beta_1$. In other words, $\beta_1$ quantifies the relationship between teaching `score` and "beauty" average `bty_avg` for *all* instructors at UT Austin. Similarly, we can view our fitted intercept $b_0$ = `r b0` as a *point estimate* of the *population intercept* $\beta_0$ for *all* instructors at UT Austin.
Putting these two ideas together, we can view the equation of the fitted line $\widehat{y}$ = $b_0 + b_1 \cdot x$ = $`r b0_trailing` + `r b1` \cdot \text{bty}\_\text{avg}$ as an estimate of some true and unknown *population line* $y = \beta_0 + \beta_1 \cdot x$. Thus we can draw parallels between our teaching evaluations analysis and all the sampling scenarios we've seen previously. In this chapter, we'll focus on the final scenario of regression slopes as shown in Table \@ref(tab:summarytable-ch11).
```{r summarytable-ch11, echo=FALSE, message=FALSE, purl=FALSE}
# The following Google Doc is published to CSV and loaded using read_csv():
# https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
if (!file.exists("rds/sampling_scenarios.rds")) {
sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>%
read_csv(na = "") %>%
slice(1:5)
write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
} else {
sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
}
sampling_scenarios %>%
# filter(Scenario %in% 1:5) %>%
kable(
caption = "Scenarios of sampling for inference",
booktabs = TRUE,
escape = FALSE,
linesep = ""
) %>%
kable_styling(
font_size = ifelse(is_latex_output(), 10, 16),
latex_options = c("hold_position")
) %>%
column_spec(1, width = "0.5in") %>%
column_spec(2, width = "0.7in") %>%
column_spec(3, width = "1in") %>%
column_spec(4, width = "1.1in") %>%
column_spec(5, width = "1in")
```
Since we are now viewing our fitted slope $b_1$ and fitted intercept $b_0$ as *point estimates* based on a *sample*, these estimates will again be subject to *sampling variability*. In other words, if we collected a new sample of data on a different set of $n$ = `r n_evals_ch5` courses and their instructors, the new fitted slope $b_1$ will likely differ from `r b1`. The same goes for the new fitted intercept $b_0$. But by how much will these estimates *vary*? This information is in the remaining columns of the regression table in Table \@ref(tab:regtable-11). Our knowledge of sampling from Chapter \@ref(sampling), confidence intervals from Chapter \@ref(confidence-intervals), and hypothesis tests from Chapter \@ref(hypothesis-testing) will help us interpret these remaining columns.
## Interpreting regression tables {#regression-interp}
We've so far focused only on the two leftmost columns of the regression table in Table \@ref(tab:regtable-11): `term` and `estimate`. Let's now shift our attention to the remaining columns: `std_error`, `statistic`, `p_value`, `lower_ci` and `upper_ci` in Table \@ref(tab:score-model-part-deux).
```{r score-model-part-deux, echo=FALSE, purl=FALSE}
get_regression_table(score_model) %>%
kable(
caption = "Previously seen regression table",
digits = 3,
booktabs = TRUE,
linesep = ""
) %>%
kable_styling(
font_size = ifelse(is_latex_output(), 10, 16),
latex_options = c("hold_position")
)
```
Given the lack of practical interpretation for the fitted intercept $b_0$, in this section we'll focus only on the second row of the table corresponding to the fitted slope $b_1$. We'll first interpret the `std_error`, `statistic`, `p_value`, `lower_ci` and `upper_ci` columns. Afterwards in the upcoming Subsection \@ref(regression-table-computation), we'll discuss how R computes these values.
### Standard error {#regression-se}
The third column of the regression table in Table \@ref(tab:regtable-11) `std_error` corresponds to the *standard error* of our estimates. Recall the definition of **standard error** \index{standard error} we saw in Subsection \@ref(sampling-definitions):
> The *standard error* is the standard deviation of any point estimate computed from a sample.
So what does this mean in terms of the fitted slope $b_1$ = `r b1`? This value is just one possible value of the fitted slope resulting from *this particular sample* of $n$ = `r n_evals_ch5` pairs of teaching and beauty scores. However, if we collected a different sample of $n$ = `r n_evals_ch5` pairs of teaching and beauty scores, we will almost certainly obtain a different fitted slope $b_1$. This is due to *sampling variability*.
```{r echo=FALSE, purl=FALSE}
# This code is used for dynamic non-static in-line text output purposes
n_reps <- 1000L
```
Say we hypothetically collected `r n_reps` such samples of pairs of teaching and beauty scores, computed the `r n_reps` resulting values of the fitted slope $b_1$, and visualized them in a histogram. This would be a visualization of the *sampling distribution* of $b_1$, which we defined in Subsection \@ref(sampling-definitions). Further recall that the standard deviation of the *sampling distribution* of $b_1$ has a special name: the *standard error*.
Recall that we constructed three sampling distributions for the sample proportion $\widehat{p}$ using shovels of size 25, 50, and 100 in Figure \@ref(fig:comparing-sampling-distributions). We observed that as the sample size increased, the standard error decreased as evidenced by the narrowing sampling distribution.
The *standard error* of $b_1$ similarly quantifies how much variation in the fitted slope $b_1$ one would expect between different samples. So in our case, we can expect about `r se1` units of variation in the `bty_avg` slope variable. Recall that the `estimate` and `std_error` values play a key role in *inferring* the value of the unknown population slope $\beta_1$ relating to *all* instructors.
In Section \@ref(infer-regression), we'll perform a simulation using the `infer` package to construct the bootstrap distribution for $b_1$ in this case. Recall from Subsection \@ref(bootstrap-vs-sampling) that the bootstrap distribution is an *approximation* to the sampling distribution in that they have a similar shape. Since they have a similar shape, they have similar *standard errors*. However, unlike the sampling distribution, the bootstrap distribution is constructed from a *single* sample, which is a practice more aligned with what's done in real life.
### Test statistic {#regression-test-statistic}
The fourth column of the regression table in Table \@ref(tab:regtable-11) `statistic` corresponds to a *test statistic* relating to the following *hypothesis test*:
$$
\begin{aligned}
H_0 &: \beta_1 = 0\\
\text{vs } H_A&: \beta_1 \neq 0.
\end{aligned}
$$
Recall our terminology, notation, and definitions related to hypothesis tests we introduced in Section \@ref(understanding-ht).
> A *hypothesis test* consists of a test between two competing hypotheses: (1) a *null hypothesis* $H_0$ versus (2) an *alternative hypothesis* $H_A$.
>
> A *test statistic* is a point estimate/sample statistic formula used for hypothesis testing.
Here, our *null hypothesis* $H_0$ assumes that the population slope $\beta_1$ is 0. If the population slope $\beta_1$ is truly 0, then this is saying that there is *no true relationship* between teaching and "beauty" scores for *all* the instructors in our population. In other words, $x$ = "beauty" score would have no associated effect on $y$ = teaching score.
The *alternative hypothesis* $H_A$, on the other hand, assumes that the population slope $\beta_1$ is not 0, meaning it could be either positive or negative. This suggests either a positive or negative relationship between teaching and "beauty" scores. Recall we called such alternative hypotheses *two-sided*. By convention, all hypothesis testing for regression assumes two-sided alternatives.
Recall our "hypothesized universe" of no gender discrimination we *assumed* in our `promotions` activity in Section \@ref(ht-activity). Similarly here when conducting this hypothesis test, we'll assume a "hypothesized universe" where there is no relationship between teaching and "beauty" scores. In other words, we'll assume the null hypothesis $H_0: \beta_1 = 0$ is true.
The `statistic` column in the regression table is a tricky one, however. It corresponds to a standardized *t-test statistic*, much like the *two-sample $t$ statistic* we saw in Subsection \@ref(theory-hypo) where we used a theory-based method for conducting hypothesis tests. In both these cases, the *null distribution* can be mathematically proven to be a *$t$-distribution*. Since such test statistics are tricky for individuals new to statistical inference to study, we'll skip this and jump into interpreting the $p$-value. If you're curious, we have included a discussion of this standardized *t-test statistic* in Subsection \@ref(theory-regression).
### p-value
The fifth column of the regression table in Table \@ref(tab:regtable-11) `p_value` corresponds to the *p-value* of the hypothesis test $H_0: \beta_1 = 0$ versus $H_A: \beta_1 \neq 0$.
Again recalling our terminology, notation, and definitions related to hypothesis tests we introduced in Section \@ref(understanding-ht), let's focus on the definition of the $p$-value:
> A *p-value* is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic *assuming the null hypothesis $H_0$ is true*.
Recall that you can intuitively think of the $p$-value as quantifying how "extreme" the observed fitted slope of $b_1$ = `r b1` is in a "hypothesized universe" where there is no relationship between teaching and "beauty" scores.
Following the hypothesis testing procedure we outlined in Section \@ref(ht-interpretation), since the $p$-value in this case is 0, for any choice of significance level $\alpha$ we would reject $H_0$ in favor of $H_A$. Using non-statistical language, this is saying: we reject the hypothesis that there is no relationship between teaching and "beauty" scores in favor of the hypothesis that there is. That is to say, the evidence suggests there is a significant relationship, one that is positive.
More precisely, however, the $p$-value corresponds to how extreme the observed test statistic of `r t1` is when compared to the appropriate *null distribution*. In Section \@ref(infer-regression), we'll perform a simulation using the `infer` package to construct the null distribution in this case.
An extra caveat here is that the results of this hypothesis test are only valid if certain "conditions for inference for regression" are met, which we'll introduce shortly in Section \@ref(regression-conditions).
### Confidence interval
The two rightmost columns of the regression table in Table \@ref(tab:regtable-11) (`lower_ci` and `upper_ci`) correspond to the endpoints of the 95% *confidence interval* for the population slope $\beta_1$. Recall our analogy of "nets are to fish" what "confidence intervals are to population parameters" from Section \@ref(ci-build-up). The resulting 95% confidence interval for $\beta_1$ of (`r lower1`, `r upper1`) can be thought of as a range of plausible values for the population slope $\beta_1$ of the linear relationship between teaching and "beauty" scores.
As we introduced in Subsection \@ref(shorthand) on the precise and shorthand interpretation of confidence intervals, the statistically precise interpretation of this confidence interval is: "if we repeated this sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population slope $\beta_1$." However, we'll summarize this using our shorthand interpretation that "we're 95% 'confident' that the true population slope $\beta_1$ lies between `r lower1` and `r upper1`."
Notice in this case that the resulting 95% confidence interval for $\beta_1$ of $(`r lower1`, \, `r upper1`)$ does not contain a very particular value: $\beta_1$ equals 0. Recall we mentioned that if the population regression slope $\beta_1$ is 0, this is equivalent to saying there is *no* relationship between teaching and "beauty" scores. Since $\beta_1$ = 0 is not in our plausible range of values for $\beta_1$, we are inclined to believe that there, in fact, *is* a relationship between teaching and "beauty" scores and a positive one at that. So in this case, the conclusion about the population slope $\beta_1$ from the 95% confidence interval matches the conclusion from the hypothesis test: evidence suggests that there is a meaningful relationship between teaching and "beauty" scores.
Recall from Subsection \@ref(ci-width), however, that the *confidence level* is one of many factors that determine confidence interval widths. So for example, say we used a higher confidence level of 99% instead of 95%. The resulting confidence interval for $\beta_1$ would be wider and thus might now include 0. The lesson to remember here is that any confidence-interval-based conclusion depends highly on the confidence level used.
What are the calculations that went into computing the two endpoints of the 95% confidence interval for $\beta_1$?
Recall our sampling bowl example from Subsection \@ref(theory-ci) discussing `lower_ci` and `upper_ci`. Since the sampling and bootstrap distributions of the sample proportion $\widehat{p}$ were roughly normal, we could use the rule of thumb for bell-shaped distributions from Appendix \@ref(appendix-normal-curve) to create a 95% confidence interval for $p$ with the following equation:
```{r echo=FALSE, purl=FALSE}
# This code is used for dynamic non-static in-line text output purposes
z_star <- qnorm(0.975) %>% round(2)
```
$$\widehat{p} \pm \text{MoE}_{\widehat{p}} = \widehat{p} \pm `r z_star` \cdot \text{SE}_{\widehat{p}} = \widehat{p} \pm `r z_star` \cdot \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}$$
We can generalize this to other point estimates that have roughly normally shaped sampling and/or bootstrap distributions:
$$\text{point estimate} \pm \text{MoE} = \text{point estimate} \pm `r z_star` \cdot \text{SE}.$$
We'll show in Section \@ref(infer-regression) that the sampling/bootstrap distribution for the fitted slope $b_1$ is in fact bell-shaped as well. Thus we can construct a 95% confidence interval for $\beta_1$ with the following equation:
$$b_1 \pm \text{MoE}_{b_1} = b_1 \pm `r z_star` \cdot \text{SE}_{b_1}.$$
What is the value of the standard error $\text{SE}_{b_1}$? It is in fact in the third column of the regression table in Table \@ref(tab:regtable-11): `r se1`. Thus
$$
\begin{aligned}
b_1 \pm `r z_star` \cdot \text{SE}_{b_1} &= `r b1` \pm `r z_star` \cdot `r se1` = `r b1` \pm `r (z_star * se1)`\\
&= (`r (b1 - z_star*se1) %>% round(3)`, `r (b1 + z_star*se1) %>% round(3)`)
\end{aligned}
$$
This closely matches the $(`r lower1`, `r upper1`)$ confidence interval in the last two columns of Table \@ref(tab:regtable-11).
Much like hypothesis tests, however, the results of this confidence interval also are only valid if the "conditions for inference for regression" to be discussed in Section \@ref(regression-conditions) are met.
### How does R compute the table? {#regression-table-computation}
Since we didn't perform the simulation to get the values of the standard error, test statistic, $p$-value, and endpoints of the 95% confidence interval in Table \@ref(tab:regtable-11), you might be wondering how were these values computed. What did R do behind the scenes? Does R run simulations like we did using the `infer` package in Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing) on confidence intervals and hypothesis testing?
The answer is no! Much like the theory-based method for constructing confidence intervals you saw in Subsection \@ref(theory-ci) and the theory-based hypothesis test you saw in Subsection \@ref(theory-hypo), there exist mathematical formulas that allow you to construct confidence intervals and conduct hypothesis tests for inference for regression. These formulas were derived in a time when computers didn't exist, so it would've been impossible to run the extensive computer simulations we have in this book. We present these formulas in Subsection \@ref(theory-regression) on "theory-based inference for regression."
In Section \@ref(infer-regression), we'll go over a simulation-based approach to constructing confidence intervals and conducting hypothesis tests using the `infer` package. In particular, we'll convince you that the bootstrap distribution of the fitted slope $b_1$ is indeed bell-shaped.
## Conditions for inference for regression {#regression-conditions}
Recall in Subsection \@ref(se-method) we stated that we could only use the standard-error-based method for constructing confidence intervals if the bootstrap distribution was bell shaped. Similarly, there are certain conditions that need to be met in order for the results of our hypothesis tests and confidence intervals we described in Section \@ref(regression-interp) to have valid meaning. These conditions must be met for the assumed underlying mathematical and probability theory to hold true.
For inference for regression, there are four conditions that need to be met. Note the first four letters of these conditions are highlighted in bold in what follows: **LINE**. This can serve as a nice reminder of what to check for whenever you perform linear regression. \index{regression!conditions for inference (LINE)}
1. **L**inearity of relationship between variables
1. **I**ndependence of the residuals
1. **N**ormality of the residuals
1. **E**quality of variance of the residuals
Conditions **L**, **N**, and **E** can be verified through what is known as a *residual analysis*.\index{residual analysis} Condition **I** can only be verified through an understanding of how the data was collected.
In this section, we'll go over a refresher on residuals, verify whether each of the four **LINE** conditions hold true, and then discuss the implications.
### Residuals refresher
Recall our definition of a residual from Subsection \@ref(model1points): it is the *observed value* minus the *fitted value* denoted by $y - \widehat{y}$. Recall that residuals can be thought of as the error or the "lack-of-fit" between the observed value $y$ and the fitted value $\widehat{y}$ on the regression line in Figure \@ref(fig:regline). In Figure \@ref(fig:residual-example), we illustrate one particular residual out of `r n_evals_ch5` using an arrow, as well as its corresponding observed and fitted values using a circle and a square, respectively.
```{r residual-example, echo=FALSE, fig.cap="Example of observed value, fitted value, and residual.", purl=FALSE, message=FALSE}
# Pick out one particular point to drill down on
index <- which(evals_ch5$bty_avg == 7.333 & evals_ch5$score == 4.9)
target_point <- score_model %>%
get_regression_points() %>%
slice(index)
x <- target_point$bty_avg
y <- target_point$score
y_hat <- target_point$score_hat
resid <- target_point$residual
# Plot residual
best_fit_plot <- ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
geom_point() +
labs(
x = "Beauty Score", y = "Teaching Score",
title = "Relationship of teaching and beauty scores"
) +
geom_smooth(method = "lm", se = FALSE) +
annotate("point", x = x, y = y, col = "red", size = 4) +
annotate("point", x = x, y = y_hat, col = "red", shape = 15, size = 4) +
annotate("segment",
x = x, xend = x, y = y, yend = y_hat, color = "blue",
arrow = arrow(type = "closed", length = unit(0.02, "npc"))
)
best_fit_plot
```
Furthermore, we can automate the calculation of all $n$ = `r n_evals_ch5` residuals by applying the `get_regression_points()` function to our saved regression model in `score_model`. Observe how the resulting values of `residual` are roughly equal to `score - score_hat` (there is potentially a slight difference due to rounding error).
```{r}
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
# Get regression points:
regression_points <- get_regression_points(score_model)
regression_points
```
A *residual analysis* is used to verify conditions **L**, **N**, and **E** and can be performed using appropriate data visualizations. While there are more sophisticated statistical approaches that can also be done, we'll focus on the much simpler approach of looking at plots.
### Linearity of relationship
The first condition is that the relationship between the outcome variable $y$ and the explanatory variable $x$ must be **L**inear. Recall the scatterplot in Figure \@ref(fig:regline) where we had the explanatory variable $x$ as "beauty" score and the outcome variable $y$ as teaching score. Would you say that the relationship between $x$ and $y$ is linear? It's hard to say because of the scatter of the points about the line. In the authors' opinions, we feel this relationship is "linear enough."
Let's present an example where the relationship between $x$ and $y$ is clearly not linear in Figure \@ref(fig:non-linear). In this case, the points clearly do not form a line, but rather a U-shaped polynomial curve. In this case, any results from an inference for regression would not be valid.
```{r non-linear, echo=FALSE, fig.cap="Example of a clearly non-linear relationship.", fig.height=3.3, message=FALSE, purl=FALSE}
set.seed(76)
evals_ch5 %>%
mutate(
x = bty_avg,
y = (x - 3) * (x - 6) + rnorm(n(), 0, 0.75)
) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
labs(x = "Beauty Score", y = "Teaching Score") +
geom_smooth(method = "lm", se = FALSE) +
expand_limits(y = 10)
```
### Independence of residuals
The second condition is that the residuals must be **I**ndependent. In other words, the different observations in our data must be independent of one another.
For our UT Austin data, while there is data on `r n_evals_ch5` courses, these `r n_evals_ch5` courses were actually taught by `r evals %>% select(prof_ID) %>% n_distinct()` unique instructors. In other words, the same professor is often included more than once in our data. The original `evals` data frame that we used to construct the `evals_ch5` data frame has a variable `prof_ID`, which is an anonymized identification variable for the professor:
```{r}
evals %>%
select(ID, prof_ID, score, bty_avg)
```
For example, the professor with `prof_ID` equal to 1 taught the first 4 courses in the data, the professor with `prof_ID` equal to 2 taught the next 3, and so on. Given that the same professor taught these first four courses, it is reasonable to expect that these four teaching scores are related to each other. If a professor gets a high `score` in one class, chances are fairly good they'll get a high `score` in another. This dataset thus provides different information than if we had `r n_evals_ch5` unique instructors teaching the `r n_evals_ch5` courses.
In this case, we say there exists *dependence* between observations. The first four courses taught by professor 1 are dependent, the next 3 courses taught by professor 2 are related, and so on. Any proper analysis of this data needs to take into account that we have *repeated measures* for the same profs.
So in this case, the independence condition is not met. What does this mean for our analysis? We'll address this in Subsection \@ref(what-is-the-conclusion) coming up, after we check the remaining two conditions.
### Normality of residuals
The third condition is that the residuals should follow a **N**ormal distribution. Furthermore, the center of this distribution should be 0. In other words, sometimes the regression model will make positive errors: $y - \widehat{y} > 0$. Other times, the regression model will make equally negative errors: $y - \widehat{y} < 0$. However, *on average* the errors should equal 0 and their shape should be similar to that of a bell.
The simplest way to check the normality of the residuals is to look at a histogram, which we visualize in Figure \@ref(fig:model1residualshist).
```{r, eval=FALSE}
ggplot(regression_points, aes(x = residual)) +
geom_histogram(binwidth = 0.25, color = "white") +
labs(x = "Residual")
```
```{r model1residualshist, echo=FALSE, fig.cap="Histogram of residuals.", purl=FALSE}
ggplot(regression_points, aes(x = residual)) +
geom_histogram(binwidth = 0.25, color = "white") +
labs(x = "Residual")
```
This histogram shows that we have more positive residuals than negative. Since the residual $y-\widehat{y}$ is positive when $y > \widehat{y}$, it seems our regression model's fitted teaching scores $\widehat{y}$ tend to *underestimate* the true teaching scores $y$. Furthermore, this histogram has a slight *left-skew*\index{skew} in that there is a tail on the left. This is another way to say the residuals exhibit a *negative skew*.
Is this a problem? Again, there is a certain amount of subjectivity in the response. In the authors' opinion, while there is a slight skew to the residuals, we feel it isn't drastic. On the other hand, others might disagree with our assessment.
Let's present examples where the residuals clearly do and don't follow a normal distribution in Figure \@ref(fig:normal-residuals). In this case of the model yielding the clearly non-normal residuals on the right, any results from an inference for regression would not be valid.
```{r normal-residuals, echo=FALSE, fig.cap="Example of clearly normal and clearly not normal residuals.", purl=FALSE}
sigma <- sd(regression_points$residual)
normal_and_not_examples <- evals_ch5 %>%
mutate(
`Clearly normal` = rnorm(n = n(), 0, sd = sigma),
`Clearly not normal` = rnorm(n = n(), mean = 0, sd = sigma)^2,
`Clearly not normal` = `Clearly not normal` - mean(`Clearly not normal`)
) %>%
select(bty_avg, `Clearly normal`, `Clearly not normal`) %>%
gather(type, eps, -bty_avg) %>%
ggplot(aes(x = eps)) +
geom_histogram(binwidth = 0.25, color = "white") +
labs(x = "Residual") +
facet_wrap(~type, scales = "free")
if (is_latex_output()) {
normal_and_not_examples +
theme(
strip.text = element_text(colour = "black"),
strip.background = element_rect(fill = "grey93")
)
} else {
normal_and_not_examples
}
```
### Equality of variance
The fourth and final condition is that the residuals should exhibit **E**qual variance across all values of the explanatory variable $x$. In other words, the value and spread of the residuals should not depend on the value of the explanatory variable $x$.
Recall the scatterplot in Figure \@ref(fig:regline): we had the explanatory variable $x$ of "beauty" score on the x-axis and the outcome variable $y$ of teaching score on the y-axis. Instead, let's create a scatterplot that has the same values on the x-axis, but now with the residual $y-\widehat{y}$ on the y-axis as seen in Figure \@ref(fig:numxplot6).
```{r, eval=FALSE}
ggplot(regression_points, aes(x = bty_avg, y = residual)) +
geom_point() +
labs(x = "Beauty Score", y = "Residual") +
geom_hline(yintercept = 0, col = "blue", size = 1)
```
```{r numxplot6, echo=FALSE, fig.cap="Plot of residuals over beauty score.", purl=FALSE}
ggplot(regression_points, aes(x = bty_avg, y = residual)) +
geom_point() +
labs(x = "Beauty Score", y = "Residual") +
geom_hline(yintercept = 0, col = "blue", size = 1)
```
You can think of Figure \@ref(fig:numxplot6) as a modified version of the plot with the regression line in Figure \@ref(fig:regline), but with the regression line flattened out to $y=0$. Looking at this plot, would you say that the spread of the residuals around the line at $y=0$ is constant across all values of the explanatory variable $x$ of "beauty" score? This question is rather qualitative and subjective in nature, thus different people may respond with different answers. For example, some people might say that there is slightly more variation in the residuals for smaller values of $x$ than for higher ones. However, it can be argued that there isn't a *drastic* non-constancy.
In Figure \@ref(fig:equal-variance-residuals) let's present an example where the residuals clearly do not have equal variance across all values of the explanatory variable $x$.
```{r equal-variance-residuals, echo=FALSE, fig.cap="Example of clearly non-equal variance.", purl=FALSE}
evals_ch5 %>%
mutate(eps = (rnorm(n(), 0, 0.075 * bty_avg^2)) * 0.4) %>%
ggplot(aes(x = bty_avg, y = eps)) +
geom_point() +
labs(x = "Beauty Score", y = "Residual") +
geom_hline(yintercept = 0, col = "blue", size = 1)
```
Observe how the spread of the residuals increases as the value of $x$ increases. This is a situation known as \index{heteroskedasticity} *heteroskedasticity*. Any inference for regression based on a model yielding such a pattern in the residuals would not be valid.
### What's the conclusion? {#what-is-the-conclusion}
Let's list our four conditions for inference for regression again and indicate whether or not they were satisfied in our analysis:
1. **L**inearity of relationship between variables: Yes
1. **I**ndependence of residuals: No
1. **N**ormality of residuals: Somewhat
1. **E**quality of variance: Yes
So what does this mean for the results of our confidence intervals and hypothesis tests in Section \@ref(regression-interp)?
First, the **I**ndependence condition. The fact that there exist dependencies between different rows in `evals_ch5` must be addressed. In more advanced statistics courses, you'll learn how to incorporate such dependencies into your regression models. One such technique is called *hierarchical/multilevel modeling*.
Second, when conditions **L**, **N**, **E** are not met, it often means there is a shortcoming in our model. For example, it may be the case that using only a single explanatory variable is insufficient, as we did with "beauty" score. We may need to incorporate more explanatory variables in a multiple regression model as we did in Chapter \@ref(multiple-regression).
In our case, the best we can do is view the results suggested by our confidence intervals and hypothesis tests as preliminary. While a preliminary analysis suggests that there is a significant relationship between teaching and "beauty" scores, further investigation is warranted; in particular, by improving the preliminary `score ~ bty_avg` model so that the four conditions are met. When the four conditions are roughly met, then we can put more faith into our confidence intervals and $p$-values.
The conditions for inference in regression problems are a key part of regression analysis that are of vital importance to the processes of constructing confidence intervals and conducting hypothesis tests. However, it is often the case with regression analysis in the real world that not all the conditions are completely met. Furthermore, as you saw, there is a level of subjectivity in the residual analyses to verify the **L**, **N**, and **E** conditions. So what can you do? We as authors advocate for transparency in communicating all results. This lets the stakeholders of any analysis know about a model's shortcomings or whether the model is "good enough." So while this checking of assumptions has lead to some fuzzy "it depends" results, we decided as authors to show you these scenarios to help prepare you for difficult statistical decisions you may need to make down the road.
```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Continuing with our regression using `age` as the explanatory variable and teaching `score` as the outcome variable.
- Use the `get_regression_points()` function to get the observed values, fitted values, and residuals for all `r n_evals_ch5` instructors.
- Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern but comment on what you find here.
```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```
## Simulation-based inference for regression {#infer-regression}
Recall in Subsection \@ref(regression-table-computation) when we interpreted the third through seventh columns of a regression table, we stated that R doesn't do simulations to compute these values. Rather R uses theory-based methods that involve mathematical formulas.
In this section, we'll use the simulation-based methods you previously learned in Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing) to recreate the values in the regression table in Table \@ref(tab:regtable-11). In particular, we'll use the `infer` package workflow to
* Construct a 95% confidence interval for the population slope $\beta_1$ using bootstrap resampling with replacement. We did this previously in Sections \@ref(bootstrap-process) with the `pennies` data and \@ref(case-study-two-prop-ci) with the `mythbusters_yawn` data.
* Conduct a hypothesis test of $H_0: \beta_1 = 0$ versus $H_A: \beta_1 \neq 0$ using a permutation test. We did this previously in Sections \@ref(ht-infer) with the `promotions` data and \@ref(ht-case-study) with the `movies_sample` IMDb data.
### Confidence interval for slope
We'll construct a 95% confidence interval for $\beta_1$ using the `infer` workflow outlined in Subsection \@ref(infer-workflow). Specifically, we'll first construct the bootstrap distribution for the fitted slope $b_1$ using our single sample of `r n_evals_ch5` courses:
1. `specify()` the variables of interest in `evals_ch5` with the formula: `score ~ bty_avg`.
1. `generate()` replicates by using `bootstrap` resampling with replacement from the original sample of `r n_evals_ch5` courses. We generate ``reps = `r n_reps` `` replicates using `type = "bootstrap"`.
1. `calculate()` the summary statistic of interest: the fitted `slope` $b_1$.
Using this bootstrap distribution, we'll construct the 95% confidence interval using the percentile method and (if appropriate) the standard error method as well. It is important to note in this case that the bootstrapping with replacement is done *row-by-row*. Thus, the original pairs of `score` and `bty_avg` values are always kept together, but different pairs of `score` and `bty_avg` values may be resampled multiple times. The resulting confidence interval will denote a range of plausible values for the unknown population slope $\beta_1$ quantifying the relationship between teaching and "beauty" scores for *all* professors at UT Austin.
Let's first construct the bootstrap distribution for the fitted slope $b_1$:
```{r eval=FALSE}
bootstrap_distn_slope <- evals_ch5 %>%
specify(formula = score ~ bty_avg) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope")
bootstrap_distn_slope
```
```{r echo=FALSE, purl=FALSE}
if (!file.exists("rds/bootstrap_distn_slope.rds")) {
set.seed(76)
bootstrap_distn_slope <- evals %>%
specify(score ~ bty_avg) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope")
saveRDS(
object = bootstrap_distn_slope,
"rds/bootstrap_distn_slope.rds"
)
} else {
bootstrap_distn_slope <- readRDS("rds/bootstrap_distn_slope.rds")
}
bootstrap_distn_slope
```
Observe how we have `r n_reps` values of the bootstrapped slope $b_1$ in the `stat` column. Let's visualize the `r n_reps` bootstrapped values in Figure \@ref(fig:bootstrap-distribution-slope).
```{r bootstrap-distribution-slope, fig.show="hold", fig.cap="Bootstrap distribution of slope.", fig.height=2.2}
visualize(bootstrap_distn_slope)
```
Observe how the bootstrap distribution is roughly bell-shaped. Recall from Subsection \@ref(bootstrap-vs-sampling) that the shape of the bootstrap distribution of $b_1$ closely approximates the shape of the sampling distribution of $b_1$.
#### Percentile-method {-}
First, let's compute the 95% confidence interval for $\beta_1$ using the percentile method. We'll do so by identifying the 2.5th and 97.5th percentiles which include the middle 95% of values. Recall that this method does not require the bootstrap distribution to be normally shaped.
```{r}
percentile_ci <- bootstrap_distn_slope %>%
get_confidence_interval(type = "percentile", level = 0.95)
percentile_ci
```
The resulting percentile-based 95% confidence interval for $\beta_1$ of (`r percentile_ci[[1]]`, `r percentile_ci[[2]]`) is similar to the confidence interval in the regression Table \@ref(tab:regtable-11) of (`r lower1`, `r upper1`).
#### Standard error method {-}
Since the bootstrap distribution in Figure \@ref(fig:bootstrap-distribution-slope) appears to be roughly bell-shaped, we can also construct a 95% confidence interval for $\beta_1$ using the standard error method.
In order to do this, we need to first compute the fitted slope $b_1$, which will act as the center of our standard error-based confidence interval. While we saw in the regression table in Table \@ref(tab:regtable-11) that this was $b_1$ = `r b1`, we can also use the `infer` pipeline with the `generate()` step removed to calculate it:
```{r}
observed_slope <- evals %>%
specify(score ~ bty_avg) %>%
calculate(stat = "slope")
observed_slope
```
We then use the `get_ci()` function with `level = 0.95` to compute the 95% confidence interval for $\beta_1$. Note that setting the `point_estimate` argument to the `observed_slope` of `r b1` sets the center of the confidence interval.
```{r}
se_ci <- bootstrap_distn_slope %>%
get_ci(level = 0.95, type = "se", point_estimate = observed_slope)
se_ci
```
The resulting standard error-based 95% confidence interval for $\beta_1$ of $(`r se_ci[["lower_ci"]]`, `r se_ci[["upper_ci"]]`)$ is slightly different than the confidence interval in the regression Table \@ref(tab:regtable-11) of $(`r lower1`, `r upper1`)$.
#### Comparing all three {-}
Let's compare all three confidence intervals in Figure \@ref(fig:bootstrap-distribution-slope-CI), where the percentile-based confidence interval is marked with solid lines, the standard error based confidence interval is marked with dashed lines, and the theory-based confidence interval (`r lower1`, `r upper1`) from the regression table in Table \@ref(tab:regtable-11) is marked with dotted lines.
```{r, eval=FALSE}
visualize(bootstrap_distn_slope) +
shade_confidence_interval(endpoints = percentile_ci, fill = NULL,
linetype = "solid", color = "grey90") +
shade_confidence_interval(endpoints = se_ci, fill = NULL,
linetype = "dashed", color = "grey60") +
shade_confidence_interval(endpoints = c(0.035, 0.099), fill = NULL,
linetype = "dotted", color = "black")
```
```{r bootstrap-distribution-slope-CI, echo=FALSE, fig.show="hold", fig.cap="Comparing three confidence intervals for the slope.", purl=FALSE, fig.height=2.5}
# Will need to make a tweak to the {infer} package so that it doesn't always display "Null" here
# (added to `develop` branch on 2019-10-26)
visualize(bootstrap_distn_slope) +
# ggtitle("") +
shade_confidence_interval(endpoints = percentile_ci, fill = NULL, linetype = "solid", color = "grey90") +
shade_confidence_interval(endpoints = se_ci, fill = NULL, linetype = "dashed", color = "grey60") +
shade_confidence_interval(endpoints = c(0.035, 0.099), fill = NULL, linetype = "dotted", color = "black")
```
Observe that all three are quite similar! Furthermore, none of the three confidence intervals for $\beta_1$ contain 0 and are entirely located above 0. This is suggesting that there is in fact a meaningful positive relationship between teaching and "beauty" scores.
### Hypothesis test for slope
Let's now conduct a hypothesis test of $H_0: \beta_1 = 0$ vs. $H_A: \beta_1 \neq 0$. We will use the `infer` package, which follows the hypothesis testing paradigm in the "There is only one test" diagram in Figure \@ref(fig:htdowney).
Let's first think about what it means for $\beta_1$ to be zero as assumed in the null hypothesis $H_0$. Recall we said if $\beta_1 = 0$, then this is saying there is no relationship between the teaching and "beauty" scores. Thus assuming this particular null hypothesis $H_0$ means that in our "hypothesized universe" there is no relationship between `score` and `bty_avg`. We can therefore shuffle/permute the `bty_avg` variable to no consequence.
We construct the null distribution of the fitted slope $b_1$ by performing the steps that follow. Recall from Section \@ref(understanding-ht) on terminology, notation, and definitions related to hypothesis testing where we defined the *null distribution*: the sampling distribution of our test statistic $b_1$ assuming the null hypothesis $H_0$ is true.
1. `specify()` the variables of interest in `evals_ch5` with the formula: `score ~ bty_avg`.
1. `hypothesize()` the null hypothesis of `independence`. Recall from Section \@ref(ht-infer) that this is an additional step that needs to be added for hypothesis testing.
1. `generate()` replicates by permuting/shuffling values from the original sample of `r n_evals_ch5` courses. We generate ``reps = `r n_reps` `` replicates using `type = "permute"` here.
1. `calculate()` the test statistic of interest: the fitted `slope` $b_1$.
In this case, we `permute` the values of `bty_avg` across the values of `score` `r n_reps` times. We can do this shuffling/permuting since we assumed a "hypothesized universe" of no relationship between these two variables. Then we `calculate` the `"slope"` coefficient for each of these `r n_reps` `generate`d samples.
```{r eval=FALSE}
null_distn_slope <- evals %>%
specify(score ~ bty_avg) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "slope")
```
```{r echo=FALSE, purl=FALSE}
if (!file.exists("rds/null_distn_slope.rds")) {
set.seed(76)
null_distn_slope <- evals %>%
specify(score ~ bty_avg) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "slope")
saveRDS(
object = null_distn_slope,
"rds/null_distn_slope.rds"
)
} else {
null_distn_slope <- readRDS("rds/null_distn_slope.rds")
}
```
Observe the resulting null distribution for the fitted slope $b_1$ in Figure \@ref(fig:null-distribution-slope).
```{r null-distribution-slope, echo=FALSE, fig.show="hold", fig.cap="Null distribution of slopes.", fig.height=2.5, purl=FALSE}
visualize(null_distn_slope)
```
Notice how it is centered at $b_1$ = 0. This is because in our hypothesized universe, there is no relationship between `score` and `bty_avg` and so $\beta_1 = 0$. Thus, the most typical fitted slope $b_1$ we observe across our simulations is 0. Observe, furthermore, how there is variation around this central value of 0.
Let's visualize the $p$-value in the null distribution by comparing it to the observed test statistic of $b_1$ = `r b1` in Figure \@ref(fig:p-value-slope). We'll do this by adding a `shade_p_value()` layer to the previous `visualize()` code.
```{r p-value-slope, echo=FALSE, fig.show="hold", fig.cap="Null distribution and $p$-value.", fig.height=3, purl=FALSE}
visualize(null_distn_slope) +
shade_p_value(obs_stat = observed_slope, direction = "both")
```
Since the observed fitted slope `r b1` falls far to the right of this null distribution and thus the shaded region doesn't overlap it, we'll have a $p$-value of 0. For completeness, however, let's compute the numerical value of the $p$-value anyways using the `get_p_value()` function. Recall that it takes the same inputs as the `shade_p_value()` function:
```{r}
null_distn_slope %>%
get_p_value(obs_stat = observed_slope, direction = "both")
```
This matches the $p$-value of 0 in the regression table in Table \@ref(tab:regtable-11). We therefore reject the null hypothesis $H_0: \beta_1 = 0$ in favor of the alternative hypothesis $H_A: \beta_1 \neq 0$. We thus have evidence that suggests there is a significant relationship between teaching and "beauty" scores for *all* instructors at UT Austin.
When the conditions for inference for regression are met and the null distribution has a bell shape, we are likely to see similar results between the simulation-based results we just demonstrated and the theory-based results shown in the regression table in Table \@ref(tab:regtable-11).
```{block lc9-5, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Repeat the inference but this time for the correlation coefficient instead of the slope. Note the implementation of `stat = "correlation"` in the `calculate()` function of the `infer` package.
```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```
## Conclusion {#inference-conclusion}
<!--
v2 TODO: Consider adding
### Relating regression to other methods
To conclude this chapter, we'll be investigating how regression relates to two different statistical techniques. One of them was covered already in this book, the difference in sample means, and the other is new to the text but is related, ANOVA. We'll see how both can be represented in the regression framework. The hope is that this closing section helps you to tie together many of the concepts you've seen in the Data Modeling and Statistical Inference parts of this book.
#### Two sample difference in means
#### ANOVA
-->
### Theory-based inference for regression {#theory-regression}
Recall in Subsection \@ref(regression-table-computation) when we interpreted the regression table in Table \@ref(tab:regtable-11), we mentioned that R does not compute its values using simulation-based methods for constructing confidence intervals and conducting hypothesis tests as we did in Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing) using the `infer` package. Rather, R uses a theory-based approach using mathematical formulas, much like the theory-based confidence intervals you saw in Subsection \@ref(theory-ci) and the theory-based hypothesis tests you saw in Subsection \@ref(theory-hypo). These formulas were derived in a time when computers didn't exist, so it would've been incredibly labor intensive to run extensive simulations.
In particular, there is a formula for the *standard error* of the fitted slope $b_1$:
$$\text{SE}_{b_1} = \dfrac{\dfrac{s_y}{s_x} \cdot \sqrt{1-r^2}}{\sqrt{n-2}}$$
<!-- Source: https://stats.stackexchange.com/questions/342632/how-to-understand-se-of-regression-slope-equation -->
As with many formulas in statistics, there's a lot going on here, so let's first break down what each symbol represents. First $s_x$ and $s_y$ are the *sample standard deviations* of the explanatory variable `bty_avg` and the response variable `score`, respectively. Second, $r$ is the sample *correlation coefficient* between `score` and `bty_avg`. This was computed as `r cor(evals_ch5$score, evals_ch5$bty_avg) %>% round(3)` in Chapter \@ref(regression). Lastly, $n$ is the number of pairs of points in the `evals_ch5` data frame, here `r n_evals_ch5`.
To put this formula into words, the standard error of $b_1$ depends on the relationship between the variability of the response variable and the variability of the explanatory variable as measured in the $s_y / s_x$ term. Next, it looks into how the two variables relate to each other in the $\sqrt{1-r^2}$ term.
However, the most important observation to make in the previous formula is that there is an $n - 2$ in the denominator. In other words, as the sample size $n$ increases, the standard error $\text{SE}_{b_1}$ decreases. Just as we demonstrated in Subsection \@ref(moral-of-the-story) when we used shovels with $n$ = 25, 50, and 100 slots, the amount of sampling variation of the fitted slope $b_1$ will depend on the sample size $n$. In particular, as the sample size increases, both the sampling and bootstrap distributions narrow and the standard error $\text{SE}_{b_1}$ decreases. Hence, our estimates of $b_1$ for the true population slope $\beta_1$ get more and more *precise*.
R then uses this formula for the standard error of $b_1$ in the third column of the regression table and subsequently to construct 95% confidence intervals. But what about the hypothesis test? Much like with our theory-based hypothesis test in Subsection \@ref(theory-hypo), R uses the following *$t$-statistic* as the test statistic for hypothesis testing:
$$
t = \dfrac{ b_1 - \beta_1}{ \text{SE}_{b_1}}
$$
And since the null hypothesis $H_0: \beta_1 = 0$ is assumed during the hypothesis test, the $t$-statistic becomes
$$
t = \dfrac{ b_1 - 0}{ \text{SE}_{b_1}} = \dfrac{ b_1 }{ \text{SE}_{b_1}}
$$
What are the values of $b_1$ and $\text{SE}_{b_1}$? They are in the `estimate` and `std_error` column of the regression table in Table \@ref(tab:regtable-11). Thus the value of `r t1` in the table is computed as `r b1`/`r se1` = `r (b1/se1) %>% round(3)`. Note there is a difference due to some rounding error here.
Lastly, to compute the $p$-value, we need to compare the observed test statistic of `r t1` to the appropriate null distribution. Recall from Section \@ref(understanding-ht), that a null distribution is the sampling distribution of the test statistic *assuming the null hypothesis $H_0$ is true*. Much like in our theory-based hypothesis test in Subsection \@ref(theory-hypo), it can be mathematically proven that this distribution is a $t$-distribution with degrees of freedom equal to $df = n - 2 = `r n_evals_ch5` - 2 = `r n_evals_ch5 - 2`$.
Don't worry if you're feeling a little overwhelmed at this point. There is a lot of background theory to understand before you can fully make sense of the equations for theory-based methods. That being said, theory-based methods and simulation-based methods for constructing confidence intervals and conducting hypothesis tests often yield consistent results. As mentioned before, in our opinion, two large benefits of simulation-based methods over theory-based are that (1) they are easier for people new to statistical inference to understand, and (2) they also work in situations where theory-based methods and mathematical formulas don't exist.
### Summary of statistical inference
We've finished the last two scenarios from the "Scenarios of sampling for inference" table in Subsection \@ref(sampling-conclusion-table), which we re-display in Table \@ref(tab:table-ch11).
```{r table-ch11, echo=FALSE, message=FALSE, purl=FALSE}
# The following Google Doc is published to CSV and loaded using read_csv():
# https://docs.google.com/spreadsheets/d/1QkOpnBGqOXGyJjwqx1T2O5G5D72wWGfWlPyufOgtkk4/edit#gid=0
if (!file.exists("rds/sampling_scenarios.rds")) {
sampling_scenarios <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRd6bBgNwM3z-AJ7o4gZOiPAdPfbTp_V15HVHRmOH5Fc9w62yaG-fEKtjNUD2wOSa5IJkrDMaEBjRnA/pub?gid=0&single=true&output=csv" %>%
read_csv(na = "") %>%
slice(1:5)
write_rds(sampling_scenarios, "rds/sampling_scenarios.rds")
} else {
sampling_scenarios <- read_rds("rds/sampling_scenarios.rds")
}
sampling_scenarios %>%
filter(Scenario %in% 1:6) %>%
kable(
caption = "\\label{tab:summarytable-ch9}Scenarios of sampling for inference",
booktabs = TRUE,
escape = FALSE,
linesep = ""
) %>%
kable_styling(
font_size = ifelse(is_latex_output(), 10, 16),
latex_options = c("hold_position")
) %>%
column_spec(1, width = "0.5in") %>%
column_spec(2, width = "1.5in") %>%
column_spec(3, width = "0.65in") %>%
column_spec(4, width = "1.6in") %>%
column_spec(5, width = "0.65in")
```
Armed with the regression modeling techniques you learned in Chapters \@ref(regression) and \@ref(multiple-regression), your understanding of sampling for inference in Chapter \@ref(sampling), and the tools for statistical inference like confidence intervals and hypothesis tests in Chapters \@ref(confidence-intervals) and \@ref(hypothesis-testing), you're now equipped to study the significance of relationships between variables in a wide array of data! Many of the ideas presented here can be extended into multiple regression and other more advanced modeling techniques.
### Additional resources
```{r echo=FALSE, results="asis", purl=FALSE}
if (is_latex_output()) {
cat("Solutions to all *Learning checks* can be found online in [Appendix D](https://moderndive.com/D-appendixD.html).")
}
```
```{r echo=FALSE, purl=FALSE, results="asis"}
generate_r_file_link("10-inference-for-regression.R")
```
### What's to come
You've now concluded the last major part of the book on "Statistical Inference with `infer`." The closing Chapter \@ref(thinking-with-data) concludes this book with various short case studies involving real data, such as house prices in the city of Seattle, Washington in the US. You'll see how the principles in this book can help you become a great storyteller with data!