-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-statistical_inference.Rmd
580 lines (469 loc) · 38.7 KB
/
07-statistical_inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(digits = 7)
```
# Introduction to Statistical Inference
This chapter will provide you with a basic intuition on statistical inference. As marketing researchers we are usually faced with "imperfect" data in the sense that we cannot collect **all** the data we would like. Imagine you are interested in the average amount of time WU students spend listening to music every month. Ideally, we could force all WU students to fill out our survey. Realistically we will only be able to observe a small fraction of students (maybe 500 out of the $25.000+$). With the data from this small fraction at hand, we want to make an inference about the true average listening time of all WU students. We are going to start with the assumption that we know everything. That is, we first assume that we know all WU students' listening times and analyze the distribution of the listening time in the entire population. Subsequently, we are going to look at the uncertainty that is introduced by only knowing some of the students' listening times (i.e., a sample from the population) and how that influences our analysis.
[You can download the corresponding R-Code here](./Code/06-statistical_inference (2).R)
## If we knew it all
Assume there are $25,000$ students at WU and every single one has kindly provided us with the hours they listened to music in the past month. In this case, we know the true mean ($49.93$ hours) and the true standard deviation (SD = $10.02$) and thus we can easily summarize the entire distribution. Since the data follows a normal distribution, roughly 95% of the values lie within 2 standard deviations from the mean, as the following plot shows:
```{r, message = FALSE, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(latex2exp)
set.seed(321)
hours <- rnorm(25000, 50, 10)
ggplot(data.frame(hours)) +
geom_histogram(aes(hours), bins = 50, fill = 'white', color = 'black') +
labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of students',
x = 'Hours') +
theme_bw() +
geom_vline(xintercept = mean(hours), size = 1) +
geom_vline(xintercept = mean(hours)+2*sd(hours), colour = "red", size = 1) +
geom_vline(xintercept = mean(hours)-2*sd(hours), colour = "red", size = 1) +
geom_segment(aes(x = mean(hours), y = 1100, yend = 1100, xend = (mean(hours) - 2*sd(hours))), lineend = "butt", linejoin = "round",
size = 0.5, arrow = arrow(length = unit(0.2, "inches"))) +
geom_segment(aes(x = mean(hours), y = 1100, yend = 1100, xend = (mean(hours) + 2*sd(hours))), lineend = "butt", linejoin = "round",
size = 0.5, arrow = arrow(length = unit(0.2, "inches"))) +
annotate("text", x = mean(hours) + 28, y = 1100, label = "Mean + 2 * SD" )+
annotate("text", x = mean(hours) -28, y = 1100, label = "Mean - 2 * SD" )
```
In this case, we refere to all WU students as **the population**. In general, the population is the entire group we are interested in. This group does not have to necessarily consist of people, but could also be companies, stores, animals, etc.. The parameters of the distribution of population values (in hour case: "hours") are called population parameters. As already mentioned, we do not usually know population parameters but use inferential statistics to infer them based on our sample from the population, i.e., we measure statistics from a sample (e.g., the sample mean $\bar x$) to estimate population parameters (the population mean $\mu$). Here, we will use the following notation to refer to either the population parameters or the sample statistic:
Variable | Sample statistic | Population parameter
------------------------- | ------------------------- | -------------------------
Size | n | N
Mean | $\bar{x} = {1 \over n}\sum_{i=1}^n x_i$ | $\mu = {1 \over N}\sum_{i=1}^N x_i$ |
Variance | $s^2 = {1 \over n-1}\sum_{i=1}^n (x_i-\bar{x})^2$ | $\sigma^2 = {1 \over N}\sum_{i=1}^N (x_i-\mu)^2$
Standard deviation | $s = \sqrt{s^2}$ | $\sigma = \sqrt{\sigma^2}$
Standard error | $SE_{\bar x} = {s \over \sqrt{n}}$ | $\sigma_{\bar x} = {\sigma \over \sqrt{n}}$
### Sampling from a known population
In the first step towards a realistic research setting, let us take one sample from this population and calculate the mean listening time. We can simply sample the row numbers of students and then subset the ```hours``` vector with the sampled rownumbers. The ```sample()``` function will be used to draw a sample of size 100 from the population of 25,000 students, and one student can only be drawn once (i.e., ```replace = FALSE```). The following plot shows the distribution of listening times for our sample.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 4, fig.width = 6}
student_sample <- sample(1:25000, size = 100, replace = FALSE)
sample_1 <- hours[student_sample]
ggplot(data.frame(sample_1)) +
geom_histogram(aes(x = sample_1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(sample_1)), size=1) +
ggtitle(TeX(sprintf("Distribution of listening times ($\\bar{x}$ = %.2f)",round(mean(sample_1),2))))
```
Observe that in this first draw the mean ($\bar x =$ `r round(mean(sample_1),2)`) is quite close to the actual mean ($\mu = $ `r round(mean(hours),2)`). It seems like the sample mean is a decent estimate of the population mean. However, we could just be lucky this time and the next sample could turn out to have a different mean. Let us continue by looking at four additional random samples, consisting of 100 students each. The following plot shows the distribution of listening times for the four different samples from the population.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 6, fig.width = 8}
#student_sample <- sample(1:25000, size = 100, replace = FALSE)
#means <- hours[student_sample]
library(cowplot)
set.seed(8830)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot1 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}_1$ = %.2f",round(mean(means1),2))))
set.seed(6789)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot2 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}_2$ = %.2f",round(mean(means1),2))))
set.seed(3904)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot3 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}_3$ = %.2f",round(mean(means1),2))))
set.seed(3333)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot4 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}_4$ = %.2f",round(mean(means1),2))))
p <- plot_grid(plot1, plot2, plot3, plot4, ncol = 2,
labels = c("A", "B","C","D"))
title <- ggdraw() + draw_label('Distribution of listening times in four different samples', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)
```
It becomes clear that the mean is slightly different for each sample. This is referred to as **sampling variation** and it is completely fine to get a slightly different mean every time we take a sample. We just need to find a way of expressing the uncertainty associated with the fact that we only have data from one sample, because in a realistic setting you are most likely only going to have access to a single sample.
So in order to make sure that the first draw is not just pure luck and the sample mean is in fact a good estimate for the population mean, let us take **many** (e.g., $20,000$) different samples from the population. That is, we repeatedly draw 100 students randomly from the population without replacement (that is, once a student has been drawn she or he is removed from the pool and cannot be drawn again) and calculate the mean of each sample. This will show us a range within which the sample mean of any sample we take is likely going to be. We are going to store the means of all the samples in a matrix and then plot a histogram of the means to observe the likely values.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 4, fig.width = 6}
set.seed(12345)
samples <- 20000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
meansdf <- data.frame('true' = mean(hours), 'sample' = mean(means))
meansdf <- gather(meansdf)
ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() +
geom_vline(data = meansdf, aes(xintercept = value, color = key, linetype = key), size=1) +
scale_color_discrete(labels = c("Mean of sample means", "Population mean")) +
scale_linetype_discrete(labels = c("Mean of sample means", "Population mean")) +
theme(legend.title = element_blank(),
legend.position = "bottom") +
labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of students',
x = 'Hours')
```
As you can see, on average the sample mean ("mean of sample means") is extremely close to the population mean, despite only sampling $100$ people at a time. This distribution of sample means is also referred to as **sampling distribution** of the sample mean. However, there is some uncertainty, and the means are slightly different for the different samples and range from `r round(min(means),2)` to `r round(max(means),2)`.
### Standard error of the mean
Due to the variation in the sample means shown in our simulation, it is never possible to say exactly what the population mean is based on a single sample. However, even with a single sample we can infer a range of values within which the population mean is likely contained. In order to do so, notice that the sample means are approximately normally distributed. Another interesting fact is that the mean of sample means (i.e., `r round(mean(means),2)`) is roughly equal to the population mean (i.e., `r round(mean(hours),2)`). This tells us already that generally the sample mean is a good approximation of the population mean. However, in order to make statements about the expected range of possible values, we would need to know the standard deviation of the sampling distribution. The formal representation of the standard deviation of the sample means is
$$
\sigma_{\bar x} = {\sigma \over \sqrt{n}}
$$
where $\sigma$ is the population SD and $n$ is the sample size. $\sigma_{\bar{x}}$ is referred to as the **Standard Error** of the mean and it expresses the variation in sample means we should expect given the number of observations in our sample and the population SD. That is, it provides a measure of how precisely we can estimate the population mean from the sample mean.
#### Sample size
The first thing to notice here is that an increase in the **number of observations per sample** $n$ decreases the range of possible sample means (i.e., the standard error). This makes intuitive sense. Think of the two extremes: sample size $1$ and sample size $25,000$. With a single person in the sample we do not gain a lot of information and our estimate is very uncertain, which is expressed through a larger standard deviation. Looking at the histogram at the beginning of this chapter showing the number of students for each of the listening times, clearly we would get values below $25$ or above $75$ for some samples. This is way farther away from the population mean than the minimum and the maximum of our $100$ person samples. On the other hand, if we sample every student we get the population mean every time and thus we do not have any uncertainty (assuming the population does not change). Even if we only sample, say $24,000$ people every time, we gain a lot of information about the population and the sample means would not be very different from each other since only up to $1,000$ people are potentially different in any given sample. Thus, with larger (smaller) samples, there is less (more) uncertainty that the sample is a good approximation of the entire population. The following plot shows the relationship between the sample size and the standard error. Samples of increasing size are randomly drawn from the population of WU students. You can see that the standard error is decreasing with the number of observations.
```{r message=FALSE, warning=FALSE, eval=TRUE, echo=FALSE, fig.align="center", fig.cap="Relationship between the sample size and the standard error"}
set.seed(321)
hours <- rnorm(25000, 50, 10)
R <- 1000
sems <- numeric()
replication <- numeric()
for (r in 10:R) {
y_sample <- sample(hours, r)
sem <- sd(hours)/sqrt(length(y_sample))
sems <- rbind(sems, sem)
replication <- rbind(replication, r)
}
df <- as.data.frame(cbind(replication, sems))
ggplot(data=df, aes(y = sems, x = replication)) +
geom_line() +
ylab("Standard error of the mean") +
xlab("Sample size") +
ggtitle('Relationship between sample size and standard error') +
theme_bw()
```
The following plots show the relationship between the sample size and the standard error in a slightly different way. The plots show the range of sample means resulting from the repeated sampling process for different sample sizes. Notice that the more students are contained in the individual samples, the less uncertainty there is when estimating the population mean from a sample (i.e., the possible values are more closely centered around the mean). So when the sample size is small, the sample mean can expected to be very different the next time we take a sample. When the sample size is large, we can expect the sample means to be more similar every time we take a sample.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 6, fig.width = 8}
library(cowplot)
library(gridExtra)
library(grid)
library(latex2exp)
set.seed(12345)
sample_size = 10
samples <- 20000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
meansdf <- data.frame('true' = mean(hours), 'sample' = mean(means))
meansdf <- gather(meansdf)
plot1 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() +
geom_vline(data = meansdf, aes(xintercept = value, color = key, linetype = key), size=1) +
scale_color_discrete(labels = c("Mean of sample means", "Population mean")) +
scale_linetype_discrete(labels = c("Mean of sample means", "Population mean")) +
theme(legend.position = "none") + ggtitle(TeX(sprintf("n = 10; $\\sigma_{\\bar x}$ = %.2f",round(sd(hours)/sqrt(sample_size),2))))
sample_size = 100
samples <- 20000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
meansdf <- data.frame('true' = mean(hours), 'sample' = mean(means))
meansdf <- gather(meansdf)
plot2 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() +
geom_vline(data = meansdf, aes(xintercept = value, color = key, linetype = key), size=1) +
scale_color_discrete(labels = c("Mean of sample means", "Population mean")) +
scale_linetype_discrete(labels = c("Mean of sample means", "Population mean")) +
theme(legend.position = "none") + ggtitle(TeX(sprintf("n = 100; $\\sigma_{\\bar x}$ = %.2f",round(sd(hours)/sqrt(sample_size),2))))
sample_size = 1000
samples <- 20000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
meansdf <- data.frame('true' = mean(hours), 'sample' = mean(means))
meansdf <- gather(meansdf)
plot3 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() +
geom_vline(data = meansdf, aes(xintercept = value, color = key, linetype = key), size=1) +
scale_color_discrete(labels = c("Mean of sample means", "Population mean")) +
scale_linetype_discrete(labels = c("Mean of sample means", "Population mean")) +
theme(legend.position = "none") + ggtitle(TeX(sprintf("n = 1,000; $\\sigma_{\\bar x}$ = %.2f",round(sd(hours)/sqrt(sample_size),2))))
sample_size = 10000
samples <- 20000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
meansdf <- data.frame('true' = mean(hours), 'sample' = mean(means))
meansdf <- gather(meansdf)
plot4 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() +
geom_vline(data = meansdf, aes(xintercept = value, color = key, linetype = key), size=1) +
scale_color_discrete(labels = c("Mean of sample means", "Population mean")) +
scale_linetype_discrete(labels = c("Mean of sample means", "Population mean")) +
theme(legend.position = "none") + ggtitle(TeX(sprintf("n = 10,000; $\\sigma_{\\bar x}$ = %.2f",round(sd(hours)/sqrt(sample_size),2))))
p <- plot_grid(plot1, plot2, plot3, plot4, ncol = 2,
labels = c("A", "B","C","D"))
title <- ggdraw() + draw_label('Relationship between sample size and standard error', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)
# now add the title
#title <- ggdraw() + draw_label("", fontface='bold')
#plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
```
As you can see, the standard deviation of the sample means ($\sigma_{\bar x}$) decreases as the sample size increases as a consequence of the reduced uncertainty about the true sample mean when we take larger samples.
#### Population standard deviation
A second factor determining the standard deviation of the distribution of sample means ($\sigma_{\bar x}$) is the standard deviation associated with the population parameter ($\sigma$). Again, looking at the extremes illustrates this well. If all WU students listened to music for approximately the same amount of time, the samples would not differ much from each other. In other words, if the standard deviation in the population is lower, we expect the standard deviation of the sample means to be lower as well. This is illustrated by the following plots.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 3, fig.width = 8}
library(cowplot)
library(gridExtra)
library(grid)
set.seed(12345)
hours1 <- rnorm(25000, 50, 1)
sample_size = 100
samples <- 20000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
means[i,] <- mean(hours1[student_sample])
}
meansdf <- data.frame('true' = mean(hours1), 'sample' = mean(means))
meansdf <- gather(meansdf)
plot1 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() +
geom_vline(data = meansdf, aes(xintercept = value, color = key, linetype = key), size=1) +
scale_color_discrete(labels = c("Mean of sample means", "Population mean")) +
scale_linetype_discrete(labels = c("Mean of sample means", "Population mean")) +
theme(legend.position = "none") + ggtitle(TeX(sprintf("n = 100; $\\sigma = 1$; $\\sigma_{\\bar x}$ = %.2f",round(sd(hours1)/sqrt(sample_size),2))))
hours2 <- rnorm(25000, 50, 10)
sample_size = 100
samples <- 20000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
means[i,] <- mean(hours2[student_sample])
}
meansdf <- data.frame('true' = mean(hours2), 'sample' = mean(means))
meansdf <- gather(meansdf)
plot2 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() +
geom_vline(data = meansdf, aes(xintercept = value, color = key, linetype = key), size=1) +
scale_color_discrete(labels = c("Mean of sample means", "Population mean")) +
scale_linetype_discrete(labels = c("Mean of sample means", "Population mean")) +
theme(legend.position = "none") + ggtitle(TeX(sprintf("n = 100; $\\sigma = 10$; $\\sigma_{\\bar x}$ = %.2f",round(sd(hours2)/sqrt(sample_size),2))))
p <- plot_grid(plot1, plot2, ncol = 2,
labels = c("A", "B"))
title <- ggdraw() + draw_label('Relationship between population SD and standard error', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)
# now add the title
#title <- ggdraw() + draw_label("", fontface='bold')
#plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
```
In the first plot (panel A), we assume a much smaller population standard deviation (e.g., $\sigma$ = 1 instead of $\sigma$ = 10). Notice how the smaller (larger) the population standard deviation, the less (more) uncertainty there is when estimating the population mean from a sample (i.e., the possible values are more closely centered around the mean). So when the population SD is large, the sample mean can expected to be very different the next time we take a sample. When the population SD is small, we can expect the sample means to be more similar.
## The Central Limit Theorem
The attentive reader might have noticed that the population above was generated using a normal distribution function. It would be very restrictive if we could only analyze populations whose values are normally distributed. Furthermore, we are unable in reality to check whether the population values are normally distributed since we do not know the entire population. However, it turns out that the results generalize to many other distributions. This is described by the **Central Limit Theorem**.
The central limit theorem states that if **(1)** the population distribution has a mean (there are examples of distributions that don't have a mean , but we will ignore these here), and **(2)** we take a large enough sample, then the sampling distribution of the sample mean is approximately normally distributed. What exactly "large enough" means depends on the setting, but the interactive element at the end of this chapter illustrates how the sample size influences how accurately we can estimate the population parameters from the sample statistics.
To illustrate this, let's repeat the analysis above with a population from a gamma distribution. In the previous example, we assumed a normal distribution so it was more likely for a given student to spend around 50 hours per week listening to music. The following example depicts the case in which most students spend a similar amount of time listening to music, but there are a few students who very rarely listen to music, and some music enthusiasts with a very high level of listening time. Here is a histogram of the listening times in the population:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.height = 4, fig.width = 6}
set.seed(321)
hours <- rgamma(25000, shape = 2, scale = 10)
ggplot(data.frame(hours)) +
geom_histogram(aes(x = hours), bins = 30, fill='white', color='black') +
geom_vline(xintercept = mean(hours), size = 1) + theme_bw() +
labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of students',
x = 'Hours')
```
The vertical black line represents the population mean ($\mu$), which is `r round(mean(hours),2)` hours. The following plot depicts the histogram of listening times of four random samples from the population, each consisting of 100 students:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 6, fig.width = 8}
#student_sample <- sample(1:25000, size = 100, replace = FALSE)
#means <- hours[student_sample]
set.seed(8830)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot1 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}$ = %.2f",round(mean(means1),2))))
set.seed(6789)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot2 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}$ = %.2f",round(mean(means1),2))))
set.seed(3904)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot3 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}$ = %.2f",round(mean(means1),2))))
set.seed(3333)
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means1 <- hours[student_sample]
plot4 <- ggplot(data.frame(means1)) +
geom_histogram(aes(x = means1), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(means1)), size=1) +
ggtitle(TeX(sprintf("$\\bar{x}$ = %.2f",round(mean(means1),2))))
p <- plot_grid(plot1, plot2, plot3, plot4, ncol = 2,
labels = c("A", "B","C","D"))
title <- ggdraw() + draw_label('Distribution of listening times in four different samples', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)
```
As in the previous example, the mean is slightly different every time we take a sample due to sampling variation. Also note that the distribution of listening times no longer follows a normal distribution as a result of the fact that we now assume a gamma distribution for the population with a positive skew (i.e., lower values more likely, higher values less likely).
Let's see what happens to the distribution of sample means if we take an increasing number of samples, each drawn from the same gamma population:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 6, fig.width = 8}
library(cowplot)
library(gridExtra)
library(grid)
set.seed(321)
hours <- rgamma(25000, shape = 2, scale = 10)
samples <- 10
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
plot1 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() + xlim(14, 26) +
theme(legend.title = element_blank()) +
geom_vline(aes(xintercept = mean(means)), size=1) +
ggtitle(TeX(sprintf("%d samples; $\\mu_{\\bar x}$ = %.2f",samples, round(mean(means),2))))
samples <- 100
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
plot2 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() + xlim(14, 26) +
theme(legend.title = element_blank()) +
geom_vline(aes(xintercept = mean(means)), size=1) +
ggtitle(TeX(sprintf("%d samples; $\\mu_{\\bar x}$ = %.2f",samples, round(mean(means),2))))
samples <- 1000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
plot3 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() + xlim(14, 26) +
theme(legend.title = element_blank()) +
geom_vline(aes(xintercept = mean(means)), size=1) +
ggtitle(TeX(sprintf("%d samples; $\\mu_{\\bar x}$ = %.2f",samples, round(mean(means),2))))
samples <- 10000
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
plot4 <- ggplot(data.frame(means)) +
geom_histogram(aes(x = means), bins = 30, fill='white', color='black') +
theme_bw() + xlim(14, 26) +
theme(legend.title = element_blank()) +
geom_vline(aes(xintercept = mean(means)), size=1) +
ggtitle(TeX(sprintf("%d samples; $\\mu_{\\bar x}$ = %.2f",samples, round(mean(means),2))))
p <- plot_grid(plot1, plot2, plot3, plot4, ncol = 2,
labels = c("A", "B","C","D"))
title <- ggdraw() + draw_label('Distribution of sample means from gamma population', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)
```
Two things are worth noting: **(1)** The more (hypothetical) samples we take, the more the sampling distribution approaches a normal distribution. **(2)** The mean of the sampling distribution of the sample mean ($\mu_{\bar x}$) is very similar to the population mean ($\mu$). From this we can see that the mean of a sample is a good estimate of the population mean.
In summary, it is important to distinguish two types of variation: **(1)** For each individual sample that we may take in real life, the standard deviation ($s$) is used to describe the **natural variation** in the data and the data may follow a non-normal distribution. **(2)** If we would (hypothetically!) repeat the study many times, the sampling distribution of the sample mean follows a normal distribution for large samples sizes (even if data from each individual study are non-normal), and the standard error ($\sigma_{\bar x}$) is used to describe the **variation between study results**. This is an important feature, since many statistical tests assume that the sampling distribution is normally distributed. As we have seen, this does **not** mean that the data from one particular sample needs to follow a normal distribution.
## Using what we actually know
So far we have assumed to know the population standard deviation ($\sigma$). This an unrealistic assumption since we do not know the entire population. The best guess for the population standard deviation we have is the sample standard deviation, denoted $s$. Thus, the standard error of the mean is usually estimated from the sample standard deviation:
$$
\sigma_{\bar x} \approx SE_{\bar x}={s \over \sqrt{n}}
$$
Note that $s$ itself is a sample estimate of the population parameter $\sigma$. This additional estimation introduces further uncertainty. You can see in the interactive element below that the sample SD, on average, provides a good estimate of the population SD. That is, the distribution of sample SDs that we get by drawing many samples is centered around the population value. Again, the larger the sample, the closer any given sample SD is going to be to the population parameter and we introduce less uncertainty. One conclusion is that your sample needs to be large enough to provide a reliable estimate of the population parameters. What exactly "large enough" means depends on the setting, but the interactive element illustrates how the remaining values change as a function of the sample size.
We will not go into detail about the importance of random samples but basically the correctness of your estimate depends crucially on having a sample at hand that actually represents the population. Unfortunately, we will usually not notice if the sample is non-random. Our statistics are still a good approximation of "a" population parameter, namely the one for the population that we actually sampled but not the one we are interested in. To illustrate this uncheck the "Random Sample" box below. The new sample will be only from the top $50\%$ music listeners (but this generalizes to different types of non-random samples).
<iframe src="https://learn.wu.ac.at/shiny/imsm/clt/" style="border: none; width: 800px; height: 1265px"></iframe>
### Confidence Intervals for the Sample Mean
When we try to estimate parameters of populations (e.g., the population mean $\mu$) from a sample, the average value from a sample (e.g., the sample mean $\bar x$) only provides an estimate of what the real population parameter is. The next time you collect a sample of the same size, you could get a different average. This is sampling variation and it is completely fine to get a slightly different sample mean every time we take a sample as we have seen above. However, this inherent uncertainty about the true population parameter means that coming up with an exact estimate (i.e., a **point estimate**) for a particular population parameter is really difficult. That is why it is often informative to construct a range around that statistic (i.e., an **interval estimate**) that likely contains the population parameter with a certain level of confidence. That is, we construct an interval such that for a large share (say 95%) of the sample means we could potentially get, the population mean is within that interval.
Let us consider one random sample of 100 students from our population above.
```{r, fig.height = 4, fig.width=6}
set.seed(321)
hours <- rgamma(25000, shape = 2, scale = 10)
set.seed(6789)
sample_size <- 100
student_sample <- sample(1:25000, size = sample_size, replace = FALSE)
hours_s <- hours[student_sample]
plot2 <- ggplot(data.frame(hours_s)) +
geom_histogram(aes(x = hours_s), bins = 30, fill='white', color='black') +
theme_bw() + xlab("Hours") +
geom_vline(aes(xintercept = mean(hours_s)), size=1) +
ggtitle(TeX(sprintf("Random sample; $n$ = %d; $\\bar{x}$ = %.2f; $s$ = %.2f",sample_size,round(mean(hours_s),2),round(sd(hours_s),2))))
plot2
```
From the central limit theorem we know that the sampling distribution of the sample mean is approximately normal and we know that for the normal distribution, 95% of the values lie within about 2 standard deviations from the mean. Actually, it is not exactly 2 standard deviations from the mean. To get the exact number, we can use the quantile function for the normal distribution ```qnorm()```:
```{r, fig.height = 4, fig.width=6}
qnorm(0.975)
```
We use ```0.975``` (and not ```0.95```) to account for the probability at each end of the distribution (i.e., 2.5% at the lower end and 2.5% at the upper end). We can see that 95% of the values are roughly within `r round(qnorm(0.975),2)` standard deviations from the mean. Since we know the sample mean ($\bar x$) and we can estimate the standard deviation of the sampling distribution ($\sigma_{\bar x} \approx {s \over \sqrt{n}}$), we can now easily calculate the lower and upper boundaries of our confidence interval as:
$$
CI_{lower} = {\bar x} - z_{1-{\alpha \over 2}} * \sigma_{\bar x} \\
CI_{upper} = {\bar x} + z_{1-{\alpha \over 2}} * \sigma_{\bar x}
$$
Here, $\alpha$ refers to the significance level. You can find a detailed discussion of this point at the end of the next chapter. For now, we will adopt the widely accepted significance level of 5% and set $\alpha$ to 0.05. Thus, $\pm z_{1-{\alpha \over 2}}$ gives us the z-scores (i.e., number of standard deviations from the mean) within which range 95% of the probability density lies.
Plugging in the values from our sample, we get:
```{r, fig.height = 4, fig.width=6}
sample_mean <- mean(hours_s)
se <- sd(hours_s)/sqrt(sample_size)
ci_lower <- sample_mean - qnorm(0.975)*se
ci_upper <- sample_mean + qnorm(0.975)*se
ci_lower
ci_upper
```
such that if we collected 100 samples and computed the mean and confidence interval for each of them, in $95\%$ of the cases, the true population mean is going to be within this interval between `r round(ci_lower,2)` and `r round(ci_upper,2)`. This is illustrated in the plot below that shows the mean of the first 100 samples and their confidence intervals:
```{r, fig.height = 15, fig.width=10}
set.seed(12)
samples <- 100
hours <- rgamma(25000, shape = 2, scale = 10)
means <- matrix(NA, nrow = samples)
for (i in 1:samples){
student_sample <- sample(1:25000, size = 100, replace = FALSE)
means[i,] <- mean(hours[student_sample])
}
means_sd <- data.frame(means, lower = means - qnorm(0.975) * (sd(hours)/sqrt(100)), upper = means + qnorm(0.975) * (sd(hours)/sqrt(100)), y = 1:100)
means_sd$diff <- factor(ifelse(means_sd$lower > mean(hours) | means_sd$upper < mean(hours), 'No', 'Yes'))
ggplot2::ggplot(means_sd, aes(y = y)) +
scale_y_continuous(breaks = seq(1, 100, by = 1), expand = c(0.005,0.005)) +
geom_point(aes(x = means, color = diff)) +
geom_errorbarh(aes(xmin = lower, xmax = upper, color = diff)) +
geom_vline(xintercept = mean(hours)) +
scale_color_manual(values = c("red", "black")) +
guides(color=guide_legend(title="True mean in CI")) +
theme_bw()
```
Note that this does **not** mean that for a specific sample there is a $95\%$ chance that the population mean lies within its confidence interval. The statement depends on the large number of samples we do not actually draw in a real setting. You can view the set of all possible confidence intervals similarly to the sides of a coin or a die. If we throw a coin many times, we are going to observe head roughly half of the times. This does not, however, exclude the possibility of observing tails for the first 10 throws. Similarly, any specific confidence interval might or might not include the population mean but if we take many samples on average $95\%$ of the confidence intervals are going to include the population mean.
## Summary
When conducting research, we are (almost) always faced with the problem of not having access to the entire group we are interested in. Therefore, we use sample properties that we have derived in this chapter to do statistical inference based on a single sample. We use statistics of the sample as well as the sample size to calculate the confidence interval of our choice (e.g. $95\%$). In practice most of this is done for us, so you won't need to do this by hand. With a sample at hand, we need to choose the appropriate test and a null hypothesis. How this is done will be discussed in the next chapter.