These are example scrapers from the most minimal to the very complex.
minimal.py
This scraper demos the least amount of code to have a working scraper
multiple_extractors.py
Have 2 extractors for a single file
use extractor_name is saving of the data
multiple_sources.py
Download the first source, and use that data extracted to download the next. Saving both source files.
json_source.py
Downlaod and parse a json file
qa_results.py
Set rules around each field and during extraction the check will make sure the extracted data complies with the rules. If not then the extraction fails and no data is saved. The reason is logged as an error.
gen_cookie_requests.py
Make a request to first generate a cookie that will be used to build a new url to get the data. This is for when the cookie can be generated using pythons requests
library and does not need selenium
. This will be a new cookie for each request that is made.
dispatch_cookie_selenium.py
Generate cookies using selenium (full web browser) in the dispatcher to reuse for a few requests and generate new cookies as more things get dispatched.
page_trigger_download.py
Get multiple pages of product results. Also start the rank at 1 and not 0.
The example this scrapers is demoing could be done by dispatching urls with the page in the url, but for the sake of example we are letting the extractor trigger the next page.
Pros:
- It will get as many pages as the site has. Great if its an unknown.
- Needed if the next page is not just a simple page number in the url, therefore cannot be guessed on dispatch.
Cons:
- If a page fails to extract, no other pages will be dispatched.
- When the scrape starts, there is no way to know how many things it will scrape which makes the ETA unknown.
- Less control over the exact rate limit of the page downloads.
page_dispatch.py
The difference from example 1 is that we are dispatching each page as a task which allows for a few things.
Pros:
- We control the rate of page downloads in a more controlled way
- If a page fails to extract, other pages are still dispatched so data is only missing for that single page
- You know how much data to expect as soon as the scrpaer starts
Cons:
- If you hardcode a max page to get and there are not that many pages on the site you will get a lot of 404's
- To be dynamic, you need to first get how many pages the site has, and if that fails then nothing may dispatch