index.Rmd

---
title: "Data Wrangling"
author: "Spencer Pease"
date: "February 3, 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    echo = FALSE,
    warning = FALSE,
    message = FALSE,
    fig.retina = 4,
    fig.width = 10,
    fig.height = 7
    )

# Needed libraries
library(knitr)
library(reshape2)
library(psych)
library(ggplot2)
library(plotly)
library(dplyr)

# Source analysis
source('analysis.R')
source('multiplot.r')
source('plot_qual.r')

```

## **Metric Definition**

### Gini Coefficient:
$$\frac { \sum _{ i=1 }^{ n }{ \sum _{ j=1 }^{ n }{ { t }_{ i }{ t }_{ j }\left|
{ p }_{ i }-{ p }_{ j } \right| } } }{ 2{ T }^{ 2 }P(1-P) } $$

Where $n$ is the number of regions, ${t}_{i}$ is the total population of region
$i$, ${p}_{i}$ is the minority population proportion in region $i$, $T$ is the
total population across all regions, and $P$ is minority population proportion
across all regions.

The _Gini Coefficient_ defines segregation by the _evenness_ of a population. It
essentially describes the average difference in minority population proportions
across all regions in a city, expressed over the maximum difference in the
city to give a proportion from 0 to 1, with higher values indicating more
segregation (average difference is closer to the max difference). ${p}_{i}$ and
${t}_{i}$ give tell us the size of a minority population in a region, which
we can then compare across regions, and normalize against the total minority
population ($P$ and $T$). This metric is great for gauging differences between
regions, as it specifically compares distances between all regions. It is also a
comparison of spatial distributions, something easy to visualize and understand.
However, it is naive to think that the physical locations of a minority
population is the only thing that contributes to segregation. This measure
also leaves out possibly important factors such as region location and detailed
dynamics within a region, such as the size of two groups being compared.


### Correlation Ratio:
$$\frac { (I-P) }{ (1-P) } ;\quad I=\sum _{ i=1 }^{ n }{ \left[ \left( \frac
{ { x }_{ i } }{ X } \right) \left( \frac { { y }_{ i } }{ { t }_{ i } } \right)
\right] } $$

Where $n$ is the number of regions, $I$ is the _isolation index_, $P$ is the
minority population proportion across all regions, ${x}_{i}$ is the minority
population of area $i$, ${y}_{i}$ is the majority population of area $i$, $X$ is
the total minority population across all regions, and ${t}_{i}$ is the total
population of region $i$.

The _Correlation Ratio_ is a method of measuring the potential contact between
minority and majority group members, indicating the extent to which two groups
share common residential areas. This measure is an adjusted version of the
_Isolation Index_, which measures the probability a minority person shares
an area with another minority person, correcting for the possibility of more
than one minority group. It produces a value from 0 to 1, with higher values
indicating more segregation. The isolation index is determined by looking at
the proportion of minority members $\left( \frac{{x}_{i}}{X} \right)$ and
proportion of majority group members $\left( \frac{{y}_{i}}{{t}_{i}} \right)$
in a region. The correlation ratio then takes the isolation index and puts it in
the context of the total minority proportion in a city $P$. This is a good
metric to use if you want more insight on how living in a segregated area can
affect a person's life, outside of where they live. Howeverm this metric does
doesn't realate one region to another at all, which prevents us from seeing
changes across a city.


### Delta Index:
$$0.5\sum _{ i=1 }^{ n }{ \left| \left( \frac { { x }_{ i } }{ X } \right) -
\left( \frac { { a }_{ i } }{ A } \right) \right| } $$

Where $n$ is the number of regions, ${x}_{i}$ is the minority population of area
$i$, $X$ is the total minority population across all regions, ${a}_{i}$ is
the area of region $i$ in square meters, and $A$ is the total area across all
regions in square meters.

The _Delta Index_ measures the concentration of a minority group. This metric 
gives us the proportion of minority members living in areas with above average
proportions of minority people. It can be looked at as the proportion of a group
that would have to move to different regions to get a more uniform density. The
metric finds this by looking at the absolute differences in fraction of total
minorities and fraction of total area for a given region, 
$\left( \frac {{x}_{i}}{X} \right) -\left( \frac {{a}_{i}}{A} \right)$. One of 
the features of the Delta Index is that it uses area data to better understand
the physical regions were people live. Unfortunatly, it uses only one other
souce of data in it's measurements, which could leave out important information.
Also, this metric does not compare between regions, only looking at the total.
This makes it hard to look at trends between regions.


## **Metric Comparison**

After computing these metrics, we can directly compare the segregation of 
various cities:

```{r segregation table}
kable(seg.metrics, digits = 2)
```

All of the metrics used are defined on a normalized scale, with **higher values
indicating higher segregation**. It is important to note, however, that even
though all of these metrics have the same range in value, the scales are not
necessarily equivalent. A .5 Gini Coefficient is not the same as a .5 Delta 
Index, for example.

```{r most segregated}
max.gini <- seg.metrics %>% 
    filter(Gini == max(Gini)) %>% 
    select(City, Gini)

max.corr <- seg.metrics %>% 
    filter(Correlation == max(Correlation)) %>% 
    select(City, Correlation)

max.delta <- seg.metrics %>% 
    filter(Delta == max(Delta)) %>% 
    select(City, Delta)
```

According to the Gini Coefficient, the most segregated city is
**`r max.gini[1, 1]`** (`r round(max.gini[1, 2], digits = 2)`), the Correlation
Ratio says it's **`r max.corr[1, 1]`** (`r round(max.corr[1, 2], digits = 2)`),
and the Delta Index shows **`r max.delta[1, 1]`**
(`r round(max.delta[1, 2], digits = 2)`), as the most segregated. To better 
understand the variation in segregation metrics, we visualize the data:

```{r metric bar plot}

p.bar.plot <-
    plot_ly(
        seg.metrics,
        x = ~ City,
        y = ~ Gini,
        type = 'bar',
        name = 'Gini'
    ) %>%
    add_trace(
        y = ~ Correlation,
        name = 'Correlation') %>%
    add_trace(
        y = ~ Delta,
        name = 'Delta') %>%
    layout(
        title = 'Segregation Across Cities by Metric',
        xaxis = list(title = "City", 
                     tickangle = -45),
        yaxis = list(title = 'Value'),
        barmode = 'group',
        margin = list(b = 90,
                      t = 100))

p.bar.plot
```

Here we see that while the Gini Coefficient and Correlation Ratio appear to have
some nontrivial degree of correlation, the Delta Index has no relation to the
other two metrics. We can show that this is the case by testing the correlation 
of each metric:

```{r metric correlation}

# Correlation and p-value data
cors <- corr.test(seg.metrics[2:4])

gini.corr.scatter <- ggplot(
    data = seg.metrics,
    aes(x = Gini,
        y = Correlation)
    ) +
    geom_point(
        size = 3,
        color = "red"
    ) +
    stat_smooth(
        method = "lm"
    ) +
    labs(
        title = "Gini Coeff vs Correlation Ratio",
        subtitle = paste('Cor', 
                         round(cors$r[2, 1],
                               digits = 2),
                         '    p-value:',
                         round(cors$p[2, 1],
                               digits = 2)),
        x = "Gini",
        y = "Correlation"
    )

gini.delta.scatter <- ggplot(
    data = seg.metrics,
    aes(x = Gini,
        y = Delta)
    ) +
    geom_point(
        size = 3,
        color = "red"
    ) +
    stat_smooth(
        method = "lm"
    ) +
    labs(
        title = "Gini Coeff vs Delta Index",
        subtitle = paste('Cor:', 
                         round(cors$p[3, 1],
                               digits = 2),
                         '    p-value:',
                         round(cors$p[3, 1],
                               digits = 2)),
        x = "Gini",
        y = "Delta"
    )

delta.corr.scatter <- ggplot(
    data = seg.metrics,
    aes(x = Delta,
        y = Correlation)
    ) +
    geom_point(
        size = 3,
        color = "red"
    ) +
    stat_smooth(
        method = "lm"
    ) +
    labs(
        title = "Delta Index vs Correlation Ratio",
        subtitle = paste('Cor:',
                         round(cors$r[3, 2],
                               digits = 2),
                         '    p-value:',
                         round(cors$p[3, 2],
                               digits = 2)),
        x = "Delta",
        y = "Correlation"
    )


multiplot(gini.corr.scatter, 
          gini.delta.scatter,
          delta.corr.scatter,
          cols = 2)

```

These correlations can be attributed to the fact that the Delta index is the 
only index to make use of area data. Since the Gini Coefficient and Correlation
Ratio rely on many of the same variables, it makes sense that they are 
correlated because they pull from the same data. The addition of the area data
in the Delta index means it should vary differently, as it pulls from different
data.

This is evident in the change in segregation ranking for each metric. Gini and
Correlation have almost the same ranking, but the Delta Index is wildly
different.

```{r plot rank change}
# Reshape data
rank.data <- seg.metrics %>%
    melt() %>% 
    rename(metric = variable) %>% 
    dcast(metric ~ City)

par(mar = c(0, 0, 1, 0), family = 'serif')
plot.qual(
    rank.data,
    rs.ln = 6,
    alpha = 0.5,
    dt.tx = T,
    main = 'Changes in Segregation Rank Across Metrics')
```

_Note: there is no reason to start with any particular metric in the above 
visual, but keeping Correlation and Gini next to each other shows thier 
similarity._


## Metric Proposal
$$\sum _{ i=1 }^{n}{ \sum _{j=1}^{n}{ \left[ \left| \frac {{p}_{i}}{{a }_{i}} -\frac {{p}_{j}}{{a}_{j}}  \right|  \right] }} $$

Where $n$ is the number of regions, ${p}_{i}$ is the minority population 
proportion in region $i$, and ${a}_{i}$ is the area in square meters of region
$i$.

This proposed metric measures the relative difference in minority population
proportion per unit of area $\frac {p}{a}$ between all regions $i$. If a region
has a larger percentage of minorities in a smaller area than some other region,
the difference will be larger, which indicates a higher segregation. This
measure has the benefit of being able to compare multiple regions against each
other, allowing us to better understand changes across a city, as well as taking
into consideration physical population density.

```{r new metric}

seg.metrics['New Metric'] = as.vector(sapply(city.data, newMetric))

new.rank.data <- seg.metrics %>%
    melt() %>% 
    rename(metric = variable) %>% 
    dcast(metric ~ City)

max.new.metric <- seg.metrics %>% 
    filter(`New Metric` == max(`New Metric`)) %>% 
    select(City, `New Metric`)

par(mar = c(0, 0, 1, 0), family = 'serif')
plot.qual(
    new.rank.data,
    rs.ln = 6,
    alpha = 0.5,
    dt.tx = T,
    main = 'Changes in Segregation Rank Across Metrics (with New Metric)')
```

According to this new metric, the most segregated city is
**`r max.new.metric[1,1]`**.

Here again we see little relation between the new metric and the Gini
Coefficient and Correlation Ratio, likely because of the inclusion of
area data. More interesting is how the rankings change between the Delta
Index and new metric. Between these two, there is only a correlation of
**`r round(cor(seg.metrics['New Metric'], seg.metrics['Delta']), digits = 2)`**.