-
Notifications
You must be signed in to change notification settings - Fork 4
/
pitch.Rpres
67 lines (50 loc) · 3.23 KB
/
pitch.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Developing Data Products Course Project
========================================================
author: Xing Su
date: February 19, 2015
Estimating Variance
========================================================
transition: rotate
**Variance** is a statistical measure of spread of a given distribution.
For a discrete variable $X$, variance is calculated by
$$\sigma^2 =\frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2$$
where $X_i$ represents the observations, $\bar X$ represents the mean, and $n$ represents number of observations.
Since we rarely know the population statistics and are often provided with only a sample, we can estimate it using the ~~sample statistics~~.
Why does dividing by n-1 make the estimator unbiased?
========================================================
There are **two** ways of estimating the population variance using a sample:
$$S^2_{unbiased} = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1} ~~~\mbox{and}~~~ S^2_{biased} = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n}$$
The **unbiased estimator** is more commonly used and is a ~~better~~ estimate. The only difference between the two calculations is the denominator--so ***why does dividing by $n-1$ make the estimator unbiased and better?***
To show this empirically, we will leverage a [Shiny Application](https://sxing.shinyapps.io/courseProject/) to simulate and analyze the variance estimates.
Shiny App
========================================================
The ~~Simulation Experiment~~ will perform the following:
<small>1. create a population distribution by drawing a number of observations from values 1 to 20</small>
<small>2. draw a number of samples of specified size from the population</small>
<small>3. compare the individual sample variances and the true population variance</small>
<small>4. show the effects of sample size vs accuracy of variance estimated</small>
The user will be able to control the **number of oberservations**, **number of samples**, and **sample size** to generate relevant plots using `ggplot2` and Google visualiztions.
Graphing Example
========================================================
**Google Visualization Plot Example from Shiny App:**
```{r results='asis', echo=FALSE}
library(googleVis)
pop <- sample(1:20, 500, replace = TRUE)
sample<- as.data.frame(matrix(sample(pop,10000,replace = TRUE), nrow = 1000,ncol = 10))
estVar <- data.frame(estVar = rowSums((sample-rowMeans(sample))^2)/length(sample))
popHist <- gvisHistogram(estVar, options = list(
height = "200px", legend = "{position: 'none'}", title = "Distribution of Biased Sample Variances", histogram = "{ hideBucketItems: true, bucketSize: 2 }",hAxis = "{ title: 'Values', showTextEvery: 3}", vAxis = "{ title: 'Frequency'}"))
print(popHist, "chart")
```
**`ggplot2` Plot Example from Shiny App:**
```{r results='asis', echo=FALSE, fig.width = 12, fig.height = 4}
library(ggplot2)
difference <- estVar - var(pop)
difference <- cbind(index = 1:nrow(difference), difference)
varPlot <- ggplot(data = difference, aes(x = index, y = estVar)) +
geom_point() +
geom_hline(yintercept=0, col = "orange", size = 2) + geom_smooth() +
ggtitle("Difference Between Population and Biased Sample Variances") +
labs(x = "Sample", y = "Biased Variance - Population Variance")
print(varPlot)
```