forked from sierrah154/Final_Project-Aves
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Database.txt
71 lines (55 loc) · 4.6 KB
/
Database.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
### For ReadMe from Catherine Aznoe
## Database
For the database for the Aves project, the following steps were taken to source data and set up the database:
#### Sourcing Data
- Gathered csv data files for bird observations from Cornell Lab of Ornithology and obtained through the Global Biodiversity Information Facility.
- Deciding the scope of the data. The initial bird observation in California file was over 47 mil rows.
After considering which birds that are native to California and may be more susceptible to air quality factors,
I requested datasets for the 14 observed species of hummingbirds in California for the last ten years. This dataset
produced over 1.3 million rows.
- The air quality data was taken from pre-generated daily reports from the United States Environmental Protection Agency.
- Weather event data was gathered from the National Oceanic and Atmospheric Administration.
- County population data was gathered from the California Association of Counties.
- Our bird observation data was based on a geographic location and the remaining data was based on counties in California. In order to join, an API call on the Federal Communications Commission was done to gather the county name based on location. Over 93,000 locations were run through the API one at a time.
#### ETL Process
- CSV files from the gathered data were loaded and cleaned using Jupyter Notebook, Python and Pandas.
- Jupyter notebook was used for the API call on the FCC for the county JSON entry.
- For data load see: [eBird](data_load/Ebirds_Data_Load.ipynb), [air quality](data_load/Air_Quality_Data_Load.ipynb), [counties](data_load/counties.ipynb)
- Resulting CSVs exported from Jupyter Notebook were loaded into Amazon S3 for access by the team.
- Google Colab and Pyspark were used to read the csv files into DataFrames and prepared for transfer to the database.
- For Colab file see: [colab](database/aves.py)
#### Database
- Created a ERD with Quick Diagram for the database. [See ERD](database/QuickDBD-export.png)
- Created the tables in PGAdmin and tied to a RDS instance in AWS. [See Schema](database/schema.sql)
- Colab Notebook and Pyspark were used to read the dataframes into RDS and viewed with PGAdmin.
- The tables were joined together based on date and county/location data to tie air quality, weather events, and population to the counties where the hummingbirds were observed.
- Further work in the database was done using PGAdmin.
## Dashboard and Images
Tableau was used to create a few images as a part of our data analysis. For the dashboard, there is an image of it here
[Aves Dashboard](dashboard/Aves_dash_with_predictor.png) and a video of the working predictor here on Youtube: [Aves Dashboard Video](https://youtu.be/-utiJOYKQVk).
The Data that was analyzed was from hummingbird observations that were user recorded through the eBird app in California from 2009-2019. These observations were spread throughout the state.
![Hummingbird Observations in California](images/hb_locations.png)
The data showed that there was a growing trend in use of the eBird app over 2009-2019 in California as we can see in the increase in observations.
![Total Hummingbird Observations in California](images/obs_growth.png)
The eBird data was compared with air quality data from the EPA by comparing AQI measures and categories in the same county as the bird observation on the same day. The growth trend continued in observations on days with a Good AQI, but we did not see the trend continue on days with Unhealthy AQIs.
![Hummingbird Observations and AQI Categories in California](images/obs_aqi.png)
#### Requirements
Python 3.7
JSON
pandas
pathlib
glob
requests
Google Colab
findspark
os
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StringType, IntegerType, StructType, LongType, FloatType
from pyspark import SparkFiles
#### Data Citations
Global Biodiversity Information Facility, Citation GBIF.org (10 October 2020) GBIF Occurrence Download https://doi.org/10.15468/dl.2rhbnh
Global Biodiversity Information Facility, Citation GBIF.org (10 October 2020) GBIF Occurrence Download https://doi.org/10.15468/dl.fsvwgj
Environmental Protection Agency, Daily Summaries, https://aqs.epa.gov/aqsweb/airdata/download_files.html#Daily
National Oceanic and Atmospheric Administration, Storm Events Database, https://www.ncdc.noaa.gov/stormevents/choosedates.jsp?statefips=6%2CCALIFORNIA#
Federal Communication Commission (FCC), Census Block for Developers, https://geo.fcc.gov/api/census/#!/block/get_block_find
California Association of Counties, Data and Research, https://www.counties.org/data-and-research