Data Cleaning, Data Analysis and Data Exploration
Final Project Report - https://docs.google.com/document/d/10eXqXr4Je853OHkP8QWWdvTtgI-MkoKViFRyaKvKYQk/edit#
Google Doc Data Cleaning - https://docs.google.com/document/d/1SmbZb3nXtPsxA4_KVE0zSXGsQbahm659BlG7nZxJiVY/
Crime Dataset Link - https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i
Steps to Reproduce Results
-
Download dataset from Crime Dataset Link provided above. Rename file as 'NYPD_Complaint_Data_Historic.csv'.
-
Login to hadoop cluster. Create directory project.
-
Copy csv file from local to the hadoop cluster in project directory -
Command: scp -r NYPD_Complaint_Data_Historic.csv [email protected]:/home/NetID/project/ .
Also copy the scripts you want to run to the hadoop cluster.
-
Put the csv file into the Hadoop File system -
Command: hfs -put NYPD_Complaint_Data_Historic.csv
-
Running script for individual columns (Eg. for column 10)-
Command: spark-submit --py-files=helper.py col10.py NYPD_Complaint_Data_Historic.csv
-
To obtain cleaned csv file with all columns
Run script ./execute.sh
Then Run Command: spark-submit --py-files=helper.py merge.py NYPD_Complaint_Data_Historic.csv
-
Cleaned csv can be obtained using below command
Command:hfs -getmerge data.csv cleaned.csv
The complete data analysis was performed on the cleaned csv file obtained after data cleaning.
Steps to generate cleaned csv file are mentioned in Part 1 above.
Pre Requisites - Python, Pandas, Jupyter Notebook, Matplotlib, Numpy
Steps to Reproduce Results
-
Upload the cleaned csv file obtained from Part 1 to the Hadoop cluster-
Command: hfs -put cleaned.csv
-
Run scripts to generate data which will be used to plot results- Scripts can be downloaded from the Data Analysis/scripts folder.
Command: spark-submit --py-files=helper.py crimes_by_year_month.py cleaned.csv
- Copy data generated after running the scripts from the hadoop cluster to local machine in the Data Analysis/data folder-
Command: scp -r [email protected]:/home/NETID/project/DataAnalysis/* .
-
Start jupyter Notebook from the DataAnalysis folder- Command: jupyter notebook
-
Open any of the ipynb files (eg - crimes_by_year.ipynb) to run and generate plots.
-
Weather - Crime rate is high during summer as compared to winter.
Monthly Weather Data in New York collected from link - http://www.holiday-weather.com/new_york_city/averages
Utilized crimes data over the month generated during analysis to prove hypotheses.
-
Poverty - Crime rate increases with increase in poverty
Yearly Poverty Data in New York Boroughs collected from link - http://www1.nyc.gov/site/opportunity/poverty-in-nyc/data-tool.page
Utilized crimes data over the years in New York Boroughs generated during analysis to prove hypotheses.