Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
config.yaml		config.yaml
dispatch_cookie_selenium.py		dispatch_cookie_selenium.py
gen_cookie_requests.py		gen_cookie_requests.py
json_source.py		json_source.py
minimal.py		minimal.py
multiple_extractors.py		multiple_extractors.py
multiple_sources.py		multiple_sources.py
page_dispatch.py		page_dispatch.py
page_trigger_download.py		page_trigger_download.py
qa_results.py		qa_results.py

README.md

Example scrapers

These are example scrapers from the most minimal to the very complex.

Minimal

minimal.py
This scraper demos the least amount of code to have a working scraper

Multiple Extractors

multiple_extractors.py
Have 2 extractors for a single file
use extractor_name is saving of the data

Multiple source files

multiple_sources.py Download the first source, and use that data extracted to download the next. Saving both source files.

Json source

json_source.py
Downlaod and parse a json file

QA results

qa_results.py
Set rules around each field and during extraction the check will make sure the extracted data complies with the rules. If not then the extraction fails and no data is saved. The reason is logged as an error.

Generate cookies for each request using `requests`

gen_cookie_requests.py
Make a request to first generate a cookie that will be used to build a new url to get the data. This is for when the cookie can be generated using pythons requests library and does not need selenium. This will be a new cookie for each request that is made.

Generate cookies using `selenium`

dispatch_cookie_selenium.py
Generate cookies using selenium (full web browser) in the dispatcher to reuse for a few requests and generate new cookies as more things get dispatched.

Scraper multiple pages of data

Example 1 - Trigger the download for the next page on extract

page_trigger_download.py
Get multiple pages of product results. Also start the rank at 1 and not 0.
The example this scrapers is demoing could be done by dispatching urls with the page in the url, but for the sake of example we are letting the extractor trigger the next page.
Pros:

It will get as many pages as the site has. Great if its an unknown.
Needed if the next page is not just a simple page number in the url, therefore cannot be guessed on dispatch.

Cons:

If a page fails to extract, no other pages will be dispatched.
When the scrape starts, there is no way to know how many things it will scrape which makes the ETA unknown.
Less control over the exact rate limit of the page downloads.

Exmaple 2 - Dispatch all the pages needed

page_dispatch.py
The difference from example 1 is that we are dispatching each page as a task which allows for a few things.
Pros:

We control the rate of page downloads in a more controlled way
If a page fails to extract, other pages are still dispatched so data is only missing for that single page
You know how much data to expect as soon as the scrpaer starts

Cons:

If you hardcode a max page to get and there are not that many pages on the site you will get a lot of 404's
To be dynamic, you need to first get how many pages the site has, and if that fails then nothing may dispatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

README.md

Example scrapers

Minimal

Multiple Extractors

Multiple source files

Json source

QA results

Generate cookies for each request using `requests`

Generate cookies using `selenium`

Scraper multiple pages of data

Example 1 - Trigger the download for the next page on extract

Exmaple 2 - Dispatch all the pages needed

Files

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

Example scrapers

Minimal

Multiple Extractors

Multiple source files

Json source

QA results

Generate cookies for each request using requests

Generate cookies using selenium

Scraper multiple pages of data

Example 1 - Trigger the download for the next page on extract

Exmaple 2 - Dispatch all the pages needed

Generate cookies for each request using `requests`

Generate cookies using `selenium`