-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
344 lines (288 loc) · 11.6 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: "Data Wrangling"
author: "Spencer Pease"
date: "February 3, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = FALSE,
warning = FALSE,
message = FALSE,
fig.retina = 4,
fig.width = 10,
fig.height = 7
)
# Needed libraries
library(knitr)
library(reshape2)
library(psych)
library(ggplot2)
library(plotly)
library(dplyr)
# Source analysis
source('analysis.R')
source('multiplot.r')
source('plot_qual.r')
```
## **Metric Definition**
### Gini Coefficient:
$$\frac { \sum _{ i=1 }^{ n }{ \sum _{ j=1 }^{ n }{ { t }_{ i }{ t }_{ j }\left|
{ p }_{ i }-{ p }_{ j } \right| } } }{ 2{ T }^{ 2 }P(1-P) } $$
Where $n$ is the number of regions, ${t}_{i}$ is the total population of region
$i$, ${p}_{i}$ is the minority population proportion in region $i$, $T$ is the
total population across all regions, and $P$ is minority population proportion
across all regions.
The _Gini Coefficient_ defines segregation by the _evenness_ of a population. It
essentially describes the average difference in minority population proportions
across all regions in a city, expressed over the maximum difference in the
city to give a proportion from 0 to 1, with higher values indicating more
segregation (average difference is closer to the max difference). ${p}_{i}$ and
${t}_{i}$ give tell us the size of a minority population in a region, which
we can then compare across regions, and normalize against the total minority
population ($P$ and $T$). This metric is great for gauging differences between
regions, as it specifically compares distances between all regions. It is also a
comparison of spatial distributions, something easy to visualize and understand.
However, it is naive to think that the physical locations of a minority
population is the only thing that contributes to segregation. This measure
also leaves out possibly important factors such as region location and detailed
dynamics within a region, such as the size of two groups being compared.
### Correlation Ratio:
$$\frac { (I-P) }{ (1-P) } ;\quad I=\sum _{ i=1 }^{ n }{ \left[ \left( \frac
{ { x }_{ i } }{ X } \right) \left( \frac { { y }_{ i } }{ { t }_{ i } } \right)
\right] } $$
Where $n$ is the number of regions, $I$ is the _isolation index_, $P$ is the
minority population proportion across all regions, ${x}_{i}$ is the minority
population of area $i$, ${y}_{i}$ is the majority population of area $i$, $X$ is
the total minority population across all regions, and ${t}_{i}$ is the total
population of region $i$.
The _Correlation Ratio_ is a method of measuring the potential contact between
minority and majority group members, indicating the extent to which two groups
share common residential areas. This measure is an adjusted version of the
_Isolation Index_, which measures the probability a minority person shares
an area with another minority person, correcting for the possibility of more
than one minority group. It produces a value from 0 to 1, with higher values
indicating more segregation. The isolation index is determined by looking at
the proportion of minority members $\left( \frac{{x}_{i}}{X} \right)$ and
proportion of majority group members $\left( \frac{{y}_{i}}{{t}_{i}} \right)$
in a region. The correlation ratio then takes the isolation index and puts it in
the context of the total minority proportion in a city $P$. This is a good
metric to use if you want more insight on how living in a segregated area can
affect a person's life, outside of where they live. Howeverm this metric does
doesn't realate one region to another at all, which prevents us from seeing
changes across a city.
### Delta Index:
$$0.5\sum _{ i=1 }^{ n }{ \left| \left( \frac { { x }_{ i } }{ X } \right) -
\left( \frac { { a }_{ i } }{ A } \right) \right| } $$
Where $n$ is the number of regions, ${x}_{i}$ is the minority population of area
$i$, $X$ is the total minority population across all regions, ${a}_{i}$ is
the area of region $i$ in square meters, and $A$ is the total area across all
regions in square meters.
The _Delta Index_ measures the concentration of a minority group. This metric
gives us the proportion of minority members living in areas with above average
proportions of minority people. It can be looked at as the proportion of a group
that would have to move to different regions to get a more uniform density. The
metric finds this by looking at the absolute differences in fraction of total
minorities and fraction of total area for a given region,
$\left( \frac {{x}_{i}}{X} \right) -\left( \frac {{a}_{i}}{A} \right)$. One of
the features of the Delta Index is that it uses area data to better understand
the physical regions were people live. Unfortunatly, it uses only one other
souce of data in it's measurements, which could leave out important information.
Also, this metric does not compare between regions, only looking at the total.
This makes it hard to look at trends between regions.
## **Metric Comparison**
After computing these metrics, we can directly compare the segregation of
various cities:
```{r segregation table}
kable(seg.metrics, digits = 2)
```
All of the metrics used are defined on a normalized scale, with **higher values
indicating higher segregation**. It is important to note, however, that even
though all of these metrics have the same range in value, the scales are not
necessarily equivalent. A .5 Gini Coefficient is not the same as a .5 Delta
Index, for example.
```{r most segregated}
max.gini <- seg.metrics %>%
filter(Gini == max(Gini)) %>%
select(City, Gini)
max.corr <- seg.metrics %>%
filter(Correlation == max(Correlation)) %>%
select(City, Correlation)
max.delta <- seg.metrics %>%
filter(Delta == max(Delta)) %>%
select(City, Delta)
```
According to the Gini Coefficient, the most segregated city is
**`r max.gini[1, 1]`** (`r round(max.gini[1, 2], digits = 2)`), the Correlation
Ratio says it's **`r max.corr[1, 1]`** (`r round(max.corr[1, 2], digits = 2)`),
and the Delta Index shows **`r max.delta[1, 1]`**
(`r round(max.delta[1, 2], digits = 2)`), as the most segregated. To better
understand the variation in segregation metrics, we visualize the data:
```{r metric bar plot}
p.bar.plot <-
plot_ly(
seg.metrics,
x = ~ City,
y = ~ Gini,
type = 'bar',
name = 'Gini'
) %>%
add_trace(
y = ~ Correlation,
name = 'Correlation') %>%
add_trace(
y = ~ Delta,
name = 'Delta') %>%
layout(
title = 'Segregation Across Cities by Metric',
xaxis = list(title = "City",
tickangle = -45),
yaxis = list(title = 'Value'),
barmode = 'group',
margin = list(b = 90,
t = 100))
p.bar.plot
```
Here we see that while the Gini Coefficient and Correlation Ratio appear to have
some nontrivial degree of correlation, the Delta Index has no relation to the
other two metrics. We can show that this is the case by testing the correlation
of each metric:
```{r metric correlation}
# Correlation and p-value data
cors <- corr.test(seg.metrics[2:4])
gini.corr.scatter <- ggplot(
data = seg.metrics,
aes(x = Gini,
y = Correlation)
) +
geom_point(
size = 3,
color = "red"
) +
stat_smooth(
method = "lm"
) +
labs(
title = "Gini Coeff vs Correlation Ratio",
subtitle = paste('Cor',
round(cors$r[2, 1],
digits = 2),
' p-value:',
round(cors$p[2, 1],
digits = 2)),
x = "Gini",
y = "Correlation"
)
gini.delta.scatter <- ggplot(
data = seg.metrics,
aes(x = Gini,
y = Delta)
) +
geom_point(
size = 3,
color = "red"
) +
stat_smooth(
method = "lm"
) +
labs(
title = "Gini Coeff vs Delta Index",
subtitle = paste('Cor:',
round(cors$p[3, 1],
digits = 2),
' p-value:',
round(cors$p[3, 1],
digits = 2)),
x = "Gini",
y = "Delta"
)
delta.corr.scatter <- ggplot(
data = seg.metrics,
aes(x = Delta,
y = Correlation)
) +
geom_point(
size = 3,
color = "red"
) +
stat_smooth(
method = "lm"
) +
labs(
title = "Delta Index vs Correlation Ratio",
subtitle = paste('Cor:',
round(cors$r[3, 2],
digits = 2),
' p-value:',
round(cors$p[3, 2],
digits = 2)),
x = "Delta",
y = "Correlation"
)
multiplot(gini.corr.scatter,
gini.delta.scatter,
delta.corr.scatter,
cols = 2)
```
These correlations can be attributed to the fact that the Delta index is the
only index to make use of area data. Since the Gini Coefficient and Correlation
Ratio rely on many of the same variables, it makes sense that they are
correlated because they pull from the same data. The addition of the area data
in the Delta index means it should vary differently, as it pulls from different
data.
This is evident in the change in segregation ranking for each metric. Gini and
Correlation have almost the same ranking, but the Delta Index is wildly
different.
```{r plot rank change}
# Reshape data
rank.data <- seg.metrics %>%
melt() %>%
rename(metric = variable) %>%
dcast(metric ~ City)
par(mar = c(0, 0, 1, 0), family = 'serif')
plot.qual(
rank.data,
rs.ln = 6,
alpha = 0.5,
dt.tx = T,
main = 'Changes in Segregation Rank Across Metrics')
```
_Note: there is no reason to start with any particular metric in the above
visual, but keeping Correlation and Gini next to each other shows thier
similarity._
## Metric Proposal
$$\sum _{ i=1 }^{n}{ \sum _{j=1}^{n}{ \left[ \left| \frac {{p}_{i}}{{a }_{i}} -\frac {{p}_{j}}{{a}_{j}} \right| \right] }} $$
Where $n$ is the number of regions, ${p}_{i}$ is the minority population
proportion in region $i$, and ${a}_{i}$ is the area in square meters of region
$i$.
This proposed metric measures the relative difference in minority population
proportion per unit of area $\frac {p}{a}$ between all regions $i$. If a region
has a larger percentage of minorities in a smaller area than some other region,
the difference will be larger, which indicates a higher segregation. This
measure has the benefit of being able to compare multiple regions against each
other, allowing us to better understand changes across a city, as well as taking
into consideration physical population density.
```{r new metric}
seg.metrics['New Metric'] = as.vector(sapply(city.data, newMetric))
new.rank.data <- seg.metrics %>%
melt() %>%
rename(metric = variable) %>%
dcast(metric ~ City)
max.new.metric <- seg.metrics %>%
filter(`New Metric` == max(`New Metric`)) %>%
select(City, `New Metric`)
par(mar = c(0, 0, 1, 0), family = 'serif')
plot.qual(
new.rank.data,
rs.ln = 6,
alpha = 0.5,
dt.tx = T,
main = 'Changes in Segregation Rank Across Metrics (with New Metric)')
```
According to this new metric, the most segregated city is
**`r max.new.metric[1,1]`**.
Here again we see little relation between the new metric and the Gini
Coefficient and Correlation Ratio, likely because of the inclusion of
area data. More interesting is how the rankings change between the Delta
Index and new metric. Between these two, there is only a correlation of
**`r round(cor(seg.metrics['New Metric'], seg.metrics['Delta']), digits = 2)`**.