-
Notifications
You must be signed in to change notification settings - Fork 700
/
visualizing_distributions_II.Rmd
240 lines (185 loc) · 19.4 KB
/
visualizing_distributions_II.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
```
# Visualizing distributions: Empirical cumulative distribution functions and q-q plots {#ecdf-qq}
In Chapter \@ref(histograms-density-plots), I described how we can visualize distributions with histograms or density plots. Both of these approaches are highly intuitive and visually appealing. However, as discussed in that chapter, they both share the limitation that the resulting figure depends to a substantial degree on parameters the user has to choose, such as the bin width for histograms and the bandwidth for density plots. As a result, both have to be considered as an interpretation of the data rather than a direct visualization of the data itself.
As an alternative to using histograms or density plots, we could simply show all the data points individually, as a point cloud. However, this approach becomes unwieldy for very large datasets, and in any case there is value in aggregate methods that highlight properties of the distribution rather than the individual data points. To solve this problem, statisticians have invented empirical cumulative distribution functions (ecdfs) and quantile--quantile (q-q) plots. These types of visualizations require no arbitrary parameter choices, and they show all of the data at once. Unfortunately, they are a little less intuitive than a histogram or a density plot is, and I don't see them used frequently outside of highly technical publications. They are quite popular among statisticians, though, and I think anybody interested in data visualization should be familiar with these techniques.
## Empirical cumulative distribution functions
To illustrate cumulative empirical distribution functions, I will begin with a hypothetical example that is closely modeled after something I deal with a lot as a professor in the classroom: a dataset of student grades. Assume our hypothetical class has 50 students, and the students just completed an exam on which they could score between 0 and 100 points. How can we best visualize the class performance, for example to determine appropriate grade boundaries?
We can plot the total number of students that have received at most a certain number of points versus all possible point scores. This plot will be an ascending function, starting at 0 for 0 points and ending at 50 for 100 points. A different way of thinking about this visualization is the following: We can rank all students by the number of points they obtained, in ascending order (so the student with the fewest points receives the lowest rank and the student with the most points the highest), and then plot the rank versus the actual points obtained. The result is an *empirical cumulative distribution function* (ecdf) or simply *cumulative distribution.* Each dot represents one student, and the lines visualize the highest student rank observed for any possible point value (Figure \@ref(fig:student-grades)).
(ref:student-grades) Empirical cumulative distribution function of student grades for a hypothetical class of 50 students.
```{r student-grades, fig.cap='(ref:student-grades)'}
set.seed(4211)
points = round(c(rnorm(47, mean = 82, sd = 10), 45, 51, 67))
points[points > 100] <- 100
student_data <- data.frame(points, rank = rank(points, ties.method = "random"))
ggplot(student_data, aes(x = points, y = 50*..y..)) +
stat_ecdf(geom = "step", color = "#0072B2") +
geom_point(aes(y = rank), color = "#0072B2") +
scale_x_continuous(limits = c(40, 102), expand = c(0, 0), breaks = 10*(4:10)) +
scale_y_continuous(limits = c(-.5, 55), expand = c(0, 0), name = "student rank (ascending)") +
coord_cartesian(clip = "off") +
theme_dviz_grid() +
theme(axis.line.x = element_blank())
```
You may wonder what happens if we rank the students the other way round, in descending order. This ranking simply flips the function on its head. The result is still an empirical cumulative distribution function, but the lines now represent the lowest student rank observed for any possible point value (Figure \@ref(fig:student-grades-desc)).
(ref:student-grades-desc) Distribution of student grades plotted as a descending ecdf.
```{r student-grades-desc, fig.cap='(ref:student-grades-desc)'}
ggplot(student_data, aes(x = points, y = 51-50*..y..)) +
stat_ecdf(geom = "step", color = "#0072B2") +
geom_point(aes(y = 51-rank), color = "#0072B2") +
scale_x_continuous(limits = c(40, 102), expand = c(0, 0), breaks = 10*(4:10)) +
scale_y_continuous(limits = c(-.5, 55), expand = c(0, 0), name = "student rank (descending)") +
coord_cartesian(clip = "off") +
theme_dviz_grid() +
theme(axis.line.x = element_blank())
```
Ascending cumulative distribution functions are more widely known and more commonly used than descending ones, but both have important applications. Descending cumulative distribution functions are critical when we want to visualize highly skewed distributions (see Section \@ref(skewed-distributions)).
In practical applications, it is quite common to draw the ecdf without highlighting the individual points and to normalize the ranks by the maximum rank, so that the *y* axis represents the cumulative frequency (Figure \@ref(fig:student-grades-normalized)).
(ref:student-grades-normalized) Ecdf of student grades. The student ranks have been normalized to the total number of students, such that the *y* values plotted correspond to the fraction of students in the class with at most that many points.
```{r student-grades-normalized, fig.cap='(ref:student-grades-normalized)'}
ggplot(student_data, aes(x = points, y = ..y..)) +
stat_ecdf(geom = "step", color = "#0072B2", size = 0.75) +
scale_x_continuous(limits = c(40, 102), expand = c(0, 0), breaks = 10*(4:10)) +
scale_y_continuous(limits = c(-.01, 1.01), expand = c(0, 0), name = "cumulative frequency") +
coord_cartesian(clip = "off") +
theme_dviz_grid() +
theme(axis.line.x = element_blank())
```
We can directly read off key properties of the student grade distribution from this plot. For example, approximately a quarter of the students (25%) received less than 75 points. The median point value (corresponding to a cumulative frequency of 0.5) is 81. Approximately 20% of the students received 90 points or more.
I find ecdfs handy for assigning grade boundaries because they help me locate the exact cutoffs that minimize student unhappiness. For example, in this example, there's a fairly long horizontal line right below 80 points, followed by a steep rise right at 80. This feature is caused by three students receiving 80 points on their exam while the next poorer performing student received only 76. In this scenario, I might decide that everybody with a point score of 80 or more receives a B and everybody with 79 or less receives a C. The three students with 80 points are happy that they just made a B, and the student with 76 realizes that they would have had to perform much better to not receive a C. If I had set the cutoff at 77, the distribution of letter grades would have been exactly the same, but I might find the student with 76 points visiting my office hoping to negotiate their grade up. Likewise, if I had set the cutoff at 81, I would likely have had three students in my office trying to negotiate their grade.
## Highly skewed distributions {#skewed-distributions}
Many empirical datasets display highly skewed distributions, in particular with heavy tails to the right, and these distributions can be challenging to visualize. Examples of such distributions include the number of people living in different cities or counties, the number of contacts in a social network, the frequency with which individual words appear in a book, the number of academic papers written by different authors, the net worth of individuals, and the number of interaction partners of individual proteins in protein--protein interaction networks (@Clauset-et-al-2009). All these distributions have in common that their right tail decays slower than an exponential function. In practice, this means that very large values are not that rare, even if the mean of the distribution is small. An important class of such distributions are *power-law* distributions, where the likelihood to observe a value that is *x* times larger than some reference point declines as a power of *x*. To give a concrete example, consider net worth in the US, which is distributed according to a power law with exponent 2. At any given level of net worth (say, $1 million), people with half that net worth are four times as frequent, and people with twice that net worth are one-fourth as frequent. Importantly, the same relationship holds if we use $10,000 as reference point or if we use $100 million. For this reason, power-law distributions are also called *scale-free* distributions.
Here, I will first discuss the number of people living in different US counties according to the 2010 US Census. This distribution has a very long tail to the right. Even though most counties have relatively small numbers of inhabitants (the median is 25,857), a few counties have extremely large numbers of inhabitants (e.g., Los Angeles County, with 9,818,605 inhabitants). If we try to visualize the distribution of population counts as either a density plot or an ecdf, we obtain figures that are essentially useless (Figure \@ref(fig:county-populations)).
(ref:county-populations) Distribution of the number of inhabitants in US counties, according to the 2010 US Census. (a) Density plot. (b) Empirical cumulative distribution function.
```{r county-populations, fig.width = 5, fig.asp = 2*0.618, fig.cap='(ref:county-populations)'}
p1 <- ggplot(US_census, aes(x=pop2010)) +
geom_density(fill = "#56B4E9", color = "transparent") +
scale_x_continuous(expand = c(0.01, 0), name = "number of inhabitants",
limits = c(0, 2.3e6),
breaks = 0.5e6*(0:4),
labels = label_log10) +
scale_y_continuous(expand = c(0, 0), name = expression(paste("density [x", 10^-5, "]")),
breaks = c(0, 4e-6, 8e-6, 1.2e-5, 1.6e-5),
labels = c(0, 0.4, 0.8, 1.2, 1.6)) +
coord_cartesian(clip = "off") +
theme_dviz_grid(12) +
theme(plot.margin = margin(3, 1.5, 12, 1.5))
p2 <- ggplot(US_census, aes(x=pop2010)) +
stat_ecdf(geom = "step", color = "#0072B2", pad = FALSE) +
scale_x_continuous(expand = c(0.01, 0), name = "number of inhabitants",
limits = c(0, 2.3e6),
breaks = 0.5e6*(0:4),
labels = label_log10) +
scale_y_continuous(expand = c(0.01, 0), name = "cumulative frequency") +
coord_cartesian(clip = "off") +
theme_dviz_grid(12) +
theme(plot.margin = margin(3, 1.5, 12, 1.5))
plot_grid(p1, p2, ncol = 1, align = 'v', labels = 'auto')
```
The density plot (Figure \@ref(fig:county-populations)a) shows a sharp peak right at 0 and virtually no details of the distribution are visible. Similarly, the ecdf (Figure \@ref(fig:county-populations)b) shows a rapid rise near 0 and again no details of the distribution are visible. For this particular dataset, we can log-transform the data and visualize the distribution of the log-transformed values. This transformation works here because the population numbers in counties is not actually a power law, but instead follow a nearly perfect log-normal distribution (see Section \@ref(qq-plots)). Indeed, the density plot of the log-transformed values shows a nice bell curve and the corresponding ecdf shows a nice sigmoidal shape (Figure \@ref(fig:county-populations-log)).
(ref:county-populations-log) Distribution of the logarithm of the number of inhabitants in US counties. (a) Density plot. (b) Empirical cumulative distribution function.
```{r county-populations-log, fig.width = 5, fig.asp = 2*0.618, fig.cap='(ref:county-populations-log)'}
p1_log <- ggplot(US_census, aes(x=log10(pop2010))) +
geom_density(fill = "#56B4E9", color = "transparent") +
scale_x_continuous(
expand = c(0.01, 0),
name = expression(paste("log"["10"], "(number of inhabitants)"))
) +
scale_y_continuous(expand = c(0, 0), name = "density") +
coord_cartesian(clip = "off") +
theme_dviz_grid(12) +
theme(plot.margin = margin(3, 1.5, 12, 1.5))
p2_log <- ggplot(US_census, aes(x=log10(pop2010))) +
stat_ecdf(geom = "step", color = "#0072B2", pad = FALSE) +
scale_x_continuous(
expand = c(0.01, 0),
name = expression(paste("log"["10"], "(number of inhabitants)"))
) +
scale_y_continuous(expand = c(0.01, 0), name = "cumulative frequency") +
coord_cartesian(clip = "off") +
theme_dviz_grid(12) +
theme(plot.margin = margin(3, 1.5, 12, 1.5))
plot_grid(p1_log, p2_log, ncol = 1, align = 'v', labels = 'auto')
```
To see that this distribution is not a power law, we plot it as a descending ecdf with logarithmic *x* and *y* axes. In this visualization, a power law appears as a perfect straight line. For the population counts in counties, the right tail forms almost but not quite a straight line on the descending log-log ecdf plot (Figure \@ref(fig:county-populations-tail-log-log)).
(ref:county-populations-tail-log-log) Relative frequency of counties with at least that many inhabitants versus the number of county inhabitants.
```{r county-populations-tail-log-log, fig.cap='(ref:county-populations-tail-log-log)'}
ggplot(US_census, aes(x=pop2010, y = 1-..y..)) +
stat_ecdf(geom = "step", color = "#0072B2", size = 0.75, pad = FALSE) +
scale_x_log10(expand = c(0.01, 0),
breaks = c(1e2, 1e3, 1e4, 1e5, 1e6, 1e7),
labels = c(expression(10^2), expression(10^3), expression(10^4),
expression(10^5), expression(10^6), expression(10^7)),
name = "number of county inhabitants") +
scale_y_log10(expand = c(0.01, 0), breaks = c(1e-3, 1e-2, 1e-1, 1), name = "relative frequency") +
theme_dviz_grid() +
theme(plot.margin = margin(3.5, 7, 3.5, 1.5))
```
As a second example, I will use the distribution of word frequencies for all words that appear in the novel Moby Dick. This distribution follows a perfect power law. When plotted as descending ecdf with logarithmic axes, we see a nearly perfect straight line (Figure \@ref(fig:word-counts-tail-log-log)).
(ref:word-counts-tail-log-log) Distribution of word counts in the novel Moby Dick. Shown is the relative frequency of words that occur at least that many times in the novel versus the number of times words are used.
```{r word-counts-tail-log-log, fig.cap='(ref:word-counts-tail-log-log)' }
ggplot(Moby_Dick, aes(x=count, y = 1-..y..)) +
stat_ecdf(geom = "step", color = "#0072B2", size = 0.75, pad = FALSE) +
scale_x_log10(expand = c(0.01, 0), breaks = c(1, 10, 100, 1000, 10000),
name = "number of times word is used") +
scale_y_log10(expand = c(0.01, 0), breaks = c(1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1),
labels = c(expression(10^-5), expression(10^-4), expression(10^-3),
expression(10^-2), expression(10^-1), expression(10^0)),
name = "fraction of words") +
theme_dviz_grid()
```
## Quantile--quantile plots {#qq-plots}
Quantile--quantile (q-q) plots are a useful visualization when we want to determine to what extent the observed data points do or do not follow a given distribution. Just like ecdfs, q-q plots are also based on ranking the data and visualizing the relationship between ranks and actual values. However, in q-q plots we don't plot the ranks directly, we use them to predict where a given data point should fall if the data were distributed according to a specified reference distribution. Most commonly, q-q plots are constructed using a normal distribution as the reference. To give a concrete example, assume the actual data values have a mean of 10 and a standard deviation of 3. Then, assuming a normal distribution, we would expect a data point ranked at the 50th percentile to lie at position 10 (the mean), a data point at the 84th percentile to lie at position 13 (one standard deviation above from the mean), and a data point at the 2.3rd percentile to lie at position 4 (two standard deviations below the mean). We can carry out this calculation for all points in the dataset and then plot the observed values (i.e., values in the dataset) against the theoretical values (i.e., values expected given each data point's rank and the assumed reference distribution).
When we perform this procedure for the student grades distribution from the beginning of this chapter, we obtain Figure \@ref(fig:student-grades-qq).
(ref:student-grades-qq) q-q plot of student grades.
```{r student-grades-qq, fig.width = 5, fig.asp = .98, fig.cap='(ref:student-grades-qq)'}
# estimate distribution parameters (mean and sd)
params <- as.list(MASS::fitdistr(student_data$points, "normal")$estimate)
# axis line segments
df_segment <- data.frame(
x = c(50, -Inf),
xend = c(100, -Inf),
y = c(-Inf, 50),
yend = c(-Inf, 100)
)
ggplot(student_data, aes(sample = points)) +
geom_abline(slope = 1, intercept = 0, color = "grey70") +
stat_qq(dparams = params, color = "#0072B2") +
geom_segment(
data = df_segment,
aes(x = x, xend = xend, y = y, yend = yend),
size = 0.5, inherit.aes = FALSE) +
scale_x_continuous(limits = c(43, 107), expand = c(0, 0), breaks = 10*(5:10)) +
scale_y_continuous(limits = c(43, 102), expand = c(0, 0), breaks = 10*(5:10), name = "observed") +
coord_fixed(clip = "off") +
theme_dviz_open() +
theme(axis.line = element_blank())
```
The solid line here is not a regression line but indicates the points where *x* equals *y*, i.e., where the observed values equal the theoretical ones. To the extent that points fall onto that line, the data follow the assumed distribution (here, normal). We see that the student grades follow mostly a normal distribution, with a few deviations at the bottom and at the top (a few students performed worse than expected on either end). The deviations from the distribution at the top end are caused by the maximum point value of 100 in the hypothetical exam; regardless of how good the best student is, he or she could at most obtain 100 points.
We can also use a q-q plot to test my assertion from earlier in this chapter that the population counts in US counties follow a log-normal distribution. If these counts are log-normally distributed, then their log-transformed values are normally distributed and hence should fall right onto the *x* = *y* line. When making this plot, we see that the agreement between the observed and the theoretical values is exceptional (Figure \@ref(fig:county-populations-qq)). This demonstrates that the distribution of population counts among counties is indeed log-normal.
(ref:county-populations-qq) q-q plot of the logarithm of the number of inhabitants in US counties.
```{r county-populations-qq, fig.width = 5, fig.asp = 1, fig.cap='(ref:county-populations-qq)'}
# estimate distribution parameters (mean and sd)
params <- as.list(MASS::fitdistr(log(US_census$pop2010), "normal")$estimate)
# axis line segments
df_segment <- data.frame(
x = c(5, -Inf),
xend = c(15, -Inf),
y = c(-Inf, 5),
yend = c(-Inf, 15)
)
ggplot(US_census, aes(sample = log(pop2010))) +
geom_abline(slope = 1, intercept = 0, color = "grey70") +
stat_qq(dparams = params, color = "#0072B2") +
geom_segment(
data = df_segment,
aes(x = x, xend = xend, y = y, yend = yend),
size = 0.5, inherit.aes = FALSE) +
scale_x_continuous(limits = c(4, 16), expand = c(0, 0), breaks = 5+2.5*(0:4)) +
scale_y_continuous(limits = c(4, 16), expand = c(0, 0), name = "observed", breaks = 5+2.5*(0:4)) +
coord_fixed(clip = "off") +
theme_dviz_open() +
theme(axis.line = element_blank())
```