You’re going to program a legal data scraper and process a sample data file. For example, you could be using Python to turn a PDF of police activities into JSON, or making recurring API calls to pull down files.
- It's legal.
- We can run the scraper by running one script, called
scraper.py
or at least beginningscraper-
- Populate the readme for your scraper with as much helpful information as you can!
- The config file appropriately references a dataset.
- Include a truncated version of some sample data so we understand what is generated.
- Ensure you have a
schema.json
file & a blanketl.py
file in your scraper directory, and callimport etl
at the end of your scraper. Read more about Scraper Schemas here. - Stick to the format of
USA/$STATE/$COUNTY/$RECORD_TYPE
.
Navigate to our Datasets repo in DoltHub or the PostgreSQL mirror and find a source to scrape. If you have a particular dataset in mind you may need to add it dataset yourself. This takes about 5–10 minutes.
- Clone this repository. Don't know how?
cd
into the/setup_gui
directory.- Follow through the GUI.
- Mac: run the script with
python3 ScraperSetup.py
- Windows: run the executable by double clicking it.
- @Pythonidaer made an excellent walkthrough of the GUI as of the v0.0.1 release.
- Mac: run the script with
- Copy the resulting folder into your clone of
PDAP-Scrapers
.
/common
folder here!
/Base_Scripts
folder here!
Why start from scratch if we have a useful library? Keep in mind that we can always refactor your work later if necessary, so if you're not sure, we still want you to submit!
The most important thing here is that your scraper is grabbing public police data, and is legal.
Make sure you follow this guideline for creating folders:
COUNTRY/
STATE/
COUNTY/
DEPARTMENT_TYPE
(CITY)
(COUNTY)
(COLLEGE)
(STATE)
(FEDERAL)/
DEPARTMENT-X-NAME/
What kind of data are we scraping?
Police data that's already made public by a government jurisdiction.
What languages are allowed?
Python is preferred. If you use another language, we may not be able to easily fold it into our infrastructure.
Are there any specific formatting guidelines I should adhere to?
For now, if you use Python: Try to stick with PEP8 formatting. A good formatter for this is Black.