Skip to content

Commit

Permalink
day 21
Browse files Browse the repository at this point in the history
  • Loading branch information
Nics-Github committed Oct 22, 2024
1 parent b890286 commit e19bb7a
Show file tree
Hide file tree
Showing 5 changed files with 280 additions and 4 deletions.
14 changes: 11 additions & 3 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

<meta name="author" content="Nicholas Schwab">
<meta name="dcterms.date" content="2024-10-20">
<meta name="dcterms.date" content="2024-10-22">

<title>SDS 192 - SDS 192: Intro to Data Science</title>
<style>
Expand Down Expand Up @@ -185,6 +185,7 @@ <h2 id="toc-title">On this page</h2>

<ul>
<li><a href="#schedule" id="toc-schedule" class="nav-link active" data-scroll-target="#schedule">Schedule</a></li>
<li><a href="#day-21" id="toc-day-21" class="nav-link" data-scroll-target="#day-21">Day 21</a></li>
<li><a href="#day-20-pivots" id="toc-day-20-pivots" class="nav-link" data-scroll-target="#day-20-pivots">Day 20 Pivots</a></li>
<li><a href="#day-19-lab-5" id="toc-day-19-lab-5" class="nav-link" data-scroll-target="#day-19-lab-5">Day 19 lab 5</a></li>
<li><a href="#day-18-joining-tables" id="toc-day-18-joining-tables" class="nav-link" data-scroll-target="#day-18-joining-tables">Day 18 Joining tables</a></li>
Expand Down Expand Up @@ -234,7 +235,7 @@ <h1 class="title">SDS 192: Intro to Data Science</h1>
<div>
<div class="quarto-title-meta-heading">Published</div>
<div class="quarto-title-meta-contents">
<p class="date">October 20, 2024</p>
<p class="date">October 22, 2024</p>
</div>
</div>

Expand All @@ -256,7 +257,7 @@ <h1 class="title">SDS 192: Intro to Data Science</h1>
<h1>Schedule</h1>
<!-- This is the wesbite for the schedule below: -->
<!-- https://docs.google.com/spreadsheets/d/1j1Mub7lb1CJKpG-PySzdbzzgvryVQ3XrlvNgPRUcN0I/edit?usp=sharing -->
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTG_b6upDZcPxXB92LHYIhV1uAXgeVGaMsZ3sDRPuSQN74JdhkKhZQVjw6T68I3zsP8KGYzDSVTZW64/pubhtml?gid=0&amp;single=true&amp;widget=true&amp;headers=false&amp;gid=0&amp;range=A1:E24
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTG_b6upDZcPxXB92LHYIhV1uAXgeVGaMsZ3sDRPuSQN74JdhkKhZQVjw6T68I3zsP8KGYzDSVTZW64/pubhtml?gid=0&amp;single=true&amp;widget=true&amp;headers=false&amp;gid=0&amp;range=A1:E32
" width="100%" height="300">
</iframe>
<!-- # Day 38 -->
Expand Down Expand Up @@ -400,6 +401,13 @@ <h1>Schedule</h1>
<!-- [dpylr cheatsheet for wrangling](https://nyu-cdsc.github.io/learningr/assets/data-transformation.pdf) -->
<!-- [Here is the link](https://classroom.github.com/a/eSF9MUxn) for Lab 4. -->
</section>
<section id="day-21" class="level1">
<h1>Day 21</h1>
<p>Lab 5 reflection - personal information</p>
<p>Many to one join warning.</p>
<p><a href="https://mdsr-book.github.io/mdsr3e/06-dataII.html#dataII-exercises">Pivot Practice - Problem 5 chapter 6</a></p>
<p><a href="https://classroom.github.com/a/kprELIeW">Project 2</a></p>
</section>
<section id="day-20-pivots" class="level1">
<h1>Day 20 Pivots</h1>
<p><a href="./course-materials/lectures/20_SQF.html">Reflection on Lab 4</a></p>
Expand Down
239 changes: 239 additions & 0 deletions folder/lab_4_solutions.rmarkdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
---
title: 'Lab 4: Exploratory Data Analysis'
subtitle: "solutions"
format:
html:
self-contained: true
editor: source
knitr:
opts_chunk:
message: false
---

```{r}
#| warning: false
#| messages: false
#| document: show
library(tidyverse)
sqf_url <- "https://www1.nyc.gov/assets/nypd/downloads/zip/analysis_and_planning/stop-question-frisk/sqf-2011-csv.zip"
temp <- tempfile()
download.file(sqf_url, temp)
sqf_zip <- unzip(temp, "2011.csv")
sqf_2011 <- read.csv(sqf_zip, stringsAsFactors = FALSE)
sqf_2011_race_cat <- read.csv("https://raw.githubusercontent.com/lindsaypoirier/STS-101/master/Data/SQF/sqf_race_categories.csv", stringsAsFactors = FALSE)
rm(sqf_url)
rm(temp)
#rm(sqf_zip)
```

```{r}
#| warning: false
#| messages: false
#| document: show
sqf_2011 <-
sqf_2011 |>
select(pct, race, age, frisked, pistol, riflshot, asltweap, knifcuti, machgun, othrweap, sumissue, arstmade) |>
left_join(sqf_2011_race_cat, by = "race") |>
mutate(across(frisked:arstmade,
~ case_when(. == "Y" ~ 1, . == "N" ~ 0)))
rm(sqf_2011_race_cat)
```

###### Question
::: question
Add two new columns. The first should indicate whether a weapon was found, and the second should indicate whether an arrest/summons was made.
:::

```{r}
#| eval: true
sqf_2011 <-
sqf_2011 |>
#Add a variable for weapon found
mutate(wpnfound = case_when(pistol == 1 |
riflshot == 1 |
asltweap == 1 |
knifcuti == 1 |
machgun == 1 |
othrweap == 1 ~ 1,
TRUE ~ 0))
sqf_2011 <-
sqf_2011 |>
#Add a variable for arrest made or summons issued
mutate(arrestsumm = case_when(sumissue == 1 |
arstmade == 1 ~ 1,
TRUE ~ 0))
```

###### Question
::: question
Subset the dataset to the six variables listed in the data dictionary above.
:::

```{r}
#| eval: true
sqf_2011 <-
sqf_2011 |>
select(pct, arrestsumm, age, wpnfound, race_cat, frisked)
```

###### Question
::: question
Calculate the number of stops in 2011. If you are not sure which function to use below, you may want to refer to the list of Summary functions in the the Data Wrangling cheatsheet. Remember that each row in the data frame is a stop.
:::

```{r}
#| eval: true
total_stops <-
sqf_2011 |>
summarize(Count = n()) |>
pull()
total_stops
```

###### Question
::: question
How many stops did not result in an arrest or summons in 2011? What percentage of stops did not result in an arrest or summons?
:::

```{r}
#| eval: true
sqf_2011 |>
#Subset to rows where suspect innocent
filter(arrestsumm == 0) |>
#Calculate number of observations
summarise(total_innocent = n(),
percent_innocent = n() / total_stops * 100)
```

###### Question
::: question
In how many stops were the individuals aged 14-24? In what percentage of stops were the individuals aged 14-24?
:::

```{r}
#| eval: true
sqf_2011 |>
#Subset to rows where suspect age 14-24
filter(age >= 14 & age <= 24) |>
#Calculate number of observations and percentage of observations
summarise(total_14_24 = n(),
percent_14_24 = n() / total_stops * 100)
```

###### Question
::: question
Fix the code below to calculate the correct number of stops for individuals 14-24.
:::

```{r}
#| eval: true
total_stops_age_recorded <-
sqf_2011 |>
#Subset to rows where age is not 999
filter(age != 999) |>
summarize(Count = n()) |>
pull()
sqf_2011 |>
filter(age >= 14 & age <= 24) |>
summarize(total_14_24 = n(),
percent_14_24 = n() / total_stops_age_recorded * 100)
```

```{r}
#| eval: true
total_stops_age_recorded <-
sqf_2011 |>
#Subset to rows where age is not 999
filter(age != 999) |>
summarize(Count = n()) |>
pull()
sqf_2011 |>
filter(age >= 14 & age < 24) |>
summarize(total_14_24 = n(),
percent_14_24 = n() / total_stops_age_recorded * 100)
```

This still doesn't match the values we see on the website, but it does match the values we see in the NYCLU's 2011 report on Stop, Question, and Frisk data. This is typically when I would reach out to a representative at the NYCLU to inquire about the discrepancy.
###### Question
::: question
How many stops were there per race in 2011? What percentage of stops per race in 2011? Arrange by number of stops in descending order.
:::

```{r}
#| eval: true
total_stops_race_recorded <-
sqf_2011 |>
#Subset to rows where race_cat is not NA or "OTHER"
filter(!is.na(race_cat) & race_cat != "OTHER") |>
summarize(Count = n()) |>
pull()
sqf_2011 |>
#Subset to rows where race_cat is not NA or "OTHER"
filter(!is.na(race_cat) & race_cat != "OTHER") |>
#Group by race
group_by(race_cat) |>
#Calculate number of observations
summarise(stops = n(),
percent_stops = n() / total_stops_race_recorded * 100) |>
#Sort by stops in descending order
arrange(desc(stops))
```

###### Question
::: question
In how many stops were the individuals identified as Latinx (i.e. "WHITE-HISPANIC" or "BLACK-HISPANIC")? In what percentage of stops were the individuals identified as Latinx?
:::

```{r}
#| eval: true
sqf_2011 |>
#Subset to rows where race_cat is "WHITE-HISPANIC" or "BLACK-HISPANIC"
filter(race_cat %in% c("WHITE-HISPANIC", "BLACK-HISPANIC")) |>
#Calculate number of observations
summarise(stops_Latinx = n(),
percent_Latinx = n() / total_stops_race_recorded * 100)
```

###### Question
::: question
What percentage of stops in 2011 resulted in a frisk per race? What percentage of stops in 2011 resulted in a weapon found per race? What percentage of stops in 2011 resulted in an arrest or summons per race? In your resulting data table, each row should be a race, and there should be columns for `stops`, `percent_stops`, `percent_frisked`, `percent_wpnfound` , and `percent_arrestsumm`.
:::

```{r}
#| eval: true
# Write code here.
sqf_2011_percents <- sqf_2011 |>
filter(race_cat %in% c(
"BLACK","WHITE-HISPANIC", "WHITE", "BLACK-HISPANIC", "ASIAN/PACIFIC ISLANDER", "AMERICAN INDIAN/ALASKAN NATIVE")
) |>
group_by(race_cat) |>
summarise(stops = n(),
percent_stops = n()/total_stops *100,
percent_frisked = sum(frisked)/n() *100,
percent_wpnfound = sum(wpnfound)/n() *100,
percent_arrestsumm = sum(arrestsumm)/n() *100
)|>
kableExtra::kable()
sqf_2011_percents
```

###### Question
::: question
Below, in 2-3 sentences, summarize what you learn from reviewing the summary statistics from the code above. What does this tell us about the constitutionality of 2011 stop and frisk activity in NYC?
:::

```{r}
#| document: show
#Make a data visualization
```

# Any lab visualization is acceptable.

```{r}
sqf_2011_percents |>
filter(!is.na(race_cat) & race_cat != "OTHER") |>
ggplot() +
geom_col(aes(x=race_cat , y= percent_frisked)) +
coord_flip()
```

17 changes: 17 additions & 0 deletions folder/lab_4_solutions_files/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"hash": "00be1c39f37dba9fcf5e1bf5c09f406a",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: 'Lab 4: Exploratory Data Analysis'\nsubtitle: \"solutions\"\nformat:\n html:\n self-contained: true\neditor: source\nknitr:\n opts_chunk:\n message: false\n---\n\n::: {.cell messages='false' document='show'}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nsqf_url <- \"https://www1.nyc.gov/assets/nypd/downloads/zip/analysis_and_planning/stop-question-frisk/sqf-2011-csv.zip\"\ntemp <- tempfile()\ndownload.file(sqf_url, temp)\nsqf_zip <- unzip(temp, \"2011.csv\")\nsqf_2011 <- read.csv(sqf_zip, stringsAsFactors = FALSE) \nsqf_2011_race_cat <- read.csv(\"https://raw.githubusercontent.com/lindsaypoirier/STS-101/master/Data/SQF/sqf_race_categories.csv\", stringsAsFactors = FALSE) \nrm(sqf_url)\nrm(temp)\nrm(sqf_zip)\n```\n:::\n\n::: {.cell messages='false' document='show'}\n\n```{.r .cell-code}\nsqf_2011 <- \n sqf_2011 |> \n select(pct, race, age, frisked, pistol, riflshot, asltweap, knifcuti, machgun, othrweap, sumissue, arstmade) |>\n left_join(sqf_2011_race_cat, by = \"race\") |>\n mutate(across(frisked:arstmade, \n ~ case_when(. == \"Y\" ~ 1, . == \"N\" ~ 0)))\nrm(sqf_2011_race_cat)\n```\n:::\n\n###### Question\n::: question\nAdd two new columns. The first should indicate whether a weapon was found, and the second should indicate whether an arrest/summons was made.\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsqf_2011 <- \n sqf_2011 |>\n #Add a variable for weapon found\n mutate(wpnfound = case_when(pistol == 1 |\n riflshot == 1 | \n asltweap == 1 |\n knifcuti == 1 | \n machgun == 1 | \n othrweap == 1 ~ 1,\n TRUE ~ 0))\nsqf_2011 <- \n sqf_2011 |>\n #Add a variable for arrest made or summons issued\n mutate(arrestsumm = case_when(sumissue == 1 | \n arstmade == 1 ~ 1,\n TRUE ~ 0))\n```\n:::\n\n###### Question\n::: question\nSubset the dataset to the six variables listed in the data dictionary above.\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsqf_2011 <-\n sqf_2011 |>\n select(pct, arrestsumm, age, wpnfound, race_cat, frisked)\n```\n:::\n\n###### Question\n::: question\nCalculate the number of stops in 2011. If you are not sure which function to use below, you may want to refer to the list of Summary functions in the the Data Wrangling cheatsheet. Remember that each row in the data frame is a stop.\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntotal_stops <-\n sqf_2011 |>\n summarize(Count = n()) |>\n pull()\ntotal_stops\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 685724\n```\n\n\n:::\n:::\n\n###### Question\n::: question\nHow many stops did not result in an arrest or summons in 2011? What percentage of stops did not result in an arrest or summons?\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsqf_2011 |>\n #Subset to rows where suspect innocent\n filter(arrestsumm == 0) |> \n #Calculate number of observations\n summarise(total_innocent = n(), \n percent_innocent = n() / total_stops * 100)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n total_innocent percent_innocent\n1 605328 88.27575\n```\n\n\n:::\n:::\n\n###### Question\n::: question\nIn how many stops were the individuals aged 14-24? In what percentage of stops were the individuals aged 14-24?\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsqf_2011 |>\n #Subset to rows where suspect age 14-24\n filter(age >= 14 & age <= 24) |> \n #Calculate number of observations and percentage of observations\n summarise(total_14_24 = n(), \n percent_14_24 = n() / total_stops * 100)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n total_14_24 percent_14_24\n1 346226 50.49058\n```\n\n\n:::\n:::\n\n###### Question\n::: question\nFix the code below to calculate the correct number of stops for individuals 14-24. \n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntotal_stops_age_recorded <-\n sqf_2011 |>\n #Subset to rows where age is not 999\n filter(age != 999) |> \n summarize(Count = n()) |>\n pull()\nsqf_2011 |>\n filter(age >= 14 & age <= 24) |>\n summarize(total_14_24 = n(), \n percent_14_24 = n() / total_stops_age_recorded * 100)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n total_14_24 percent_14_24\n1 346226 50.5542\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntotal_stops_age_recorded <-\n sqf_2011 |>\n #Subset to rows where age is not 999\n filter(age != 999) |> \n summarize(Count = n()) |>\n pull()\nsqf_2011 |>\n filter(age >= 14 & age < 24) |>\n summarize(total_14_24 = n(), \n percent_14_24 = n() / total_stops_age_recorded * 100)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n total_14_24 percent_14_24\n1 321381 46.92646\n```\n\n\n:::\n:::\n\nThis still doesn't match the values we see on the website, but it does match the values we see in the NYCLU's 2011 report on Stop, Question, and Frisk data. This is typically when I would reach out to a representative at the NYCLU to inquire about the discrepancy.\n###### Question\n::: question\nHow many stops were there per race in 2011? What percentage of stops per race in 2011? Arrange by number of stops in descending order.\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntotal_stops_race_recorded <-\n sqf_2011 |>\n #Subset to rows where race_cat is not NA or \"OTHER\"\n filter(!is.na(race_cat) & race_cat != \"OTHER\") |> \n summarize(Count = n()) |>\n pull()\nsqf_2011 |>\n #Subset to rows where race_cat is not NA or \"OTHER\"\n filter(!is.na(race_cat) & race_cat != \"OTHER\") |> \n #Group by race\n group_by(race_cat) |> \n #Calculate number of observations\n summarise(stops = n(), \n percent_stops = n() / total_stops_race_recorded * 100) |>\n #Sort by stops in descending order\n arrange(desc(stops)) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 3\n race_cat stops percent_stops\n <chr> <int> <dbl>\n1 BLACK 350743 52.9 \n2 WHITE-HISPANIC 175302 26.4 \n3 WHITE 61805 9.32 \n4 BLACK-HISPANIC 48438 7.30 \n5 ASIAN/PACIFIC ISLANDER 23932 3.61 \n6 AMERICAN INDIAN/ALASKAN NATIVE 2897 0.437\n```\n\n\n:::\n:::\n\n###### Question\n::: question\nIn how many stops were the individuals identified as Latinx (i.e. \"WHITE-HISPANIC\" or \"BLACK-HISPANIC\")? In what percentage of stops were the individuals identified as Latinx?\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nsqf_2011 |>\n #Subset to rows where race_cat is \"WHITE-HISPANIC\" or \"BLACK-HISPANIC\"\n filter(race_cat %in% c(\"WHITE-HISPANIC\", \"BLACK-HISPANIC\")) |> \n #Calculate number of observations\n summarise(stops_Latinx = n(), \n percent_Latinx = n() / total_stops_race_recorded * 100)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n stops_Latinx percent_Latinx\n1 223740 33.74065\n```\n\n\n:::\n:::\n\n###### Question\n::: question\nWhat percentage of stops in 2011 resulted in a frisk per race? What percentage of stops in 2011 resulted in a weapon found per race? What percentage of stops in 2011 resulted in an arrest or summons per race? In your resulting data table, each row should be a race, and there should be columns for `stops`, `percent_stops`, `percent_frisked`, `percent_wpnfound` , and `percent_arrestsumm`.\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Write code here. \nsqf_2011_percents <- sqf_2011 |>\n filter(race_cat %in% c(\n \"BLACK\",\"WHITE-HISPANIC\", \"WHITE\", \"BLACK-HISPANIC\", \"ASIAN/PACIFIC ISLANDER\", \"AMERICAN INDIAN/ALASKAN NATIVE\")\n ) |> \n group_by(race_cat) |>\n summarise(stops = n(), \n percent_stops = n()/total_stops *100,\n percent_frisked = sum(frisked)/n() *100,\n percent_wpnfound = sum(wpnfound)/n() *100,\n percent_arrestsumm = sum(arrestsumm)/n() *100\n )\nsqf_2011_percents\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 6\n race_cat stops percent_stops percent_frisked percent_wpnfound\n <chr> <int> <dbl> <dbl> <dbl>\n1 AMERICAN INDIAN/ALASKAN… 2897 0.422 49.4 1.10 \n2 ASIAN/PACIFIC ISLANDER 23932 3.49 47.1 0.827\n3 BLACK 350743 51.1 57.9 1.06 \n4 BLACK-HISPANIC 48438 7.06 58.5 1.09 \n5 WHITE 61805 9.01 44.2 1.89 \n6 WHITE-HISPANIC 175302 25.6 56.7 1.16 \n# ℹ 1 more variable: percent_arrestsumm <dbl>\n```\n\n\n:::\n:::\n\n###### Question\n::: question\nBelow, in 2-3 sentences, summarize what you learn from reviewing the summary statistics from the code above. What does this tell us about the constitutionality of 2011 stop and frisk activity in NYC?\n:::\n\n::: {.cell document='show'}\n\n```{.r .cell-code}\n#Make a data visualization\n```\n:::\n\n# Any lab visualization is acceptable. \n\n::: {.cell}\n\n```{.r .cell-code}\nsqf_2011_percents |>\n filter(!is.na(race_cat) & race_cat != \"OTHER\") |> \n ggplot() +\n geom_col(aes(x=race_cat , y= percent_frisked)) +\n coord_flip()\n```\n\n::: {.cell-output-display}\n![](lab_4_solutions_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n",
"supporting": [
"lab_4_solutions_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit e19bb7a

Please sign in to comment.