- Create a new directory where the scraper will live and add the following files:
- Next install this library from pypi:
pip install scraperx
- Run the full scraper by running
python your_scraper.py dispatch
- To see the arguments for the command:
python your_scraper.py dispatch -h
- See all the commands available:
python your_scraper.py -h
- To see the arguments for the command:
Sample scrapers can be found in the examples folder of this repo
Any time the scraper needs to override the bases __init__
, always pass in *args
& **kwargs
like so:
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
This is a dict of values that is passed to each step of the process. The scraper can put anything it wants here that it may need. But here are a few build in values that are not required, but are used if you do supply them:
headers
: Dict of headers to use each requestproxy
: Full proxy string to be usedproxy_country
: Used to get a proxy for this region, if this andproxy
are not set, a random proxy will be used.device_type
: used when setting a user-agent if one was not set. Options aredesktop
ormobile
Uses a requests.Session
to make get and post requests.
The __init__
of the BaseDownload
class can take the following args:
- task: Required. The task from the dispatcher
- headers: (Named arg) dict to set headers for the whole session. default: random User-Agent for the device type, will use desktop if no device type is set
- proxy: (Named arg) Proxy string to use for the requests
- ignore_codes: (Named arg) List of HTTP Status codes to not retry on. If these codes are seen, it will treat the request as any other success.
When using BaseDownloader, a requests session is created under self.session
, so every get/post you make will use the same session per task.
Headers can also be set per call by passing the keyword args to self.request_get()
and self.request_post()
. Any kwargs you pass to self.request_get/post will be passed to the sessions get/post methods.
When using BaseDownloader's get & post functions, it will use the requests session created in init and a python requests
response object.
A request will retry n times (3 by default) to get a successful status code, each retry it will try and trigger a function called new_profile()
where you have the chance to switch the headers/proxy the request is using (will only update for that request?). If that function does not exist, it will try again with the same data.
There are a few custom arguments that can be passed into the self.request_*
functions that this sdk will use. All others will be passed to the requests
methods call.
Named arguments:
- max_tries: Default=3. Type int. The number of tries a request will be tried, each try it will try and get a new proxy and User-Agent
- custom_source_checks: Default=None. Type list of lists. Used to set the request to a set status code based on a regex that runs on the page source.
- This will look to see if the words captcha are in the source page and set that response status code to a 403, with the status message being Capacha Found. The status message is there so you know if it is a real 403 or your custom status.
[(re.compile(r'captcha', re.I), 403, 'Capacha Found')]
- This will look to see if the words captcha are in the source page and set that response status code to a 403, with the status message being Capacha Found. The status message is there so you know if it is a real 403 or your custom status.
When using self.request_*
, it will return a normal requests.request response, If using custom source checks, response.reason
will be set to the custom message passed in. This is useful if you have multiple ways a custom 403 happens and you need to do different actions depending on why.
This is required for the extractor to run on the downloaded data. Inside of self.download()
just call self.save_request(r)
on the request that was made. This will add the source file to a list of saved sources that will be passed to the extractor for parsing.
Some keyword arguments that can be passed into self.save_request
- template_values {dict} - Additional keys to use in the template
- filename {str} - Override the filename from the template_name in the config
These exceptions will be raised when calling self.request_*
. They will be caught safely so the scraper does not need to catch them. But if the scraper wanted to do something based on the exception, there can be a try/except
around the scrapers self.request_*
.
scraperx.exceptions.DownloadValueError
: If there is an exception that is not caught by the othersscraperx.exceptions.HTTPIgnoreCodeError
: When the status code of the request is found in theignore_codes
argument of BaseDownloadrequests.exceptions.HTTPError
: When the requests returns a non successful status code and was not found inignore_codes
The ones set in the self.request_get/request_post
will be combined with the ones set in the __init__
and override if the key is the same.
self.request_get/request_post kwargs headers/proxy
will override
self.task[headers/proxy]
will override
init kwargs headers/proxy
Any header/proxy set on the request (get/post/etc) will only be set for that single request. For those values to be set in the session they must be set from the init or be in the task data.
If you have a list of proxies that the downloader should auto rotate between they can be saved in a csv in the following format:
proxy,country
http://user:[email protected]:5500,US
http://user:[email protected]:5501,US
http://user:[email protected]:6501,DE
Set the env var PROXY_FILE
to the path of the above csv for the scraper to load it in.
If you have not passed in a proxy directly in the task and this proxy csv exists, then it will pull a random proxy from this file. It will use the proxy_country
if set in the task data to select the correct country to proxy to.
If you have not directly set a user-agent, a random one will be pulled based on the device_type
in the task data.
If device_type
is not set, it will default to use a desktop user-agent.
To set your own list of user-agents to choose from, create a csv in the following format:
device_type,user_agent
desktop,"Some User Agent for desktop"
desktop,"Another User Agent for desktop"
mobile,"Now one for mobile"
Set the env var UA_FILE
to the path of the above csv for the scraper to load it in.
self.pre_extract()
- User can override to do their own setup after the
__init__
and before any extraction happens
self.find_css_elements(source, css_selectors)
source
- Parsel object to run the css selectors oncss_selectors
- A list of css selectors to try and extract the data
Returns a Parsel element from the first css selector that returns data.
This snippet would be in the scrapers MyScraperExtract(Extract)
class, used in the method that is extracting the data.
title_selectors = ['h3',
'span.title',
]
result['title'] = self.find_css_elements(element, title_selectors)\
.xpath('string()').extract_first().strip()
There are a few built in parsers that can assist with extracting some types of data
from scraperx import parsers
###
# Price
###
# This will parse the price out of a string and return the low and high values as floats
raw_p1_str = '15,48€'
p1 = parsers.price(raw_p1_str)
# p1 = {'low': 15.48, 'high': None}
raw_p2_str = '1,999'
p2 = parsers.price(raw_p2_str)
# p2 = {'low': 1999.0, 'high': None}
raw_p3_str = '$49.95 - $99.99'
p3 = parsers.price(raw_p3_str)
# p3 = {'low': 49.95, 'high': 99.99}
###
# Rating
###
# Parse the rating from a string
# Examples: https://regex101.com/r/ChmgmF/3
raw_r1_str = '4.4 out of 5 stars'
r1 = parsers.rating(raw_r1_str)
# r1 = 4.4
raw_r2_str = 'An average of 4.1 star'
r2 = parsers.rating(raw_r2_str)
# r2 = 4.1
If there are more cases you would like these parsers to catch please open up an issue with the use case you are trying to parse.
When updating the extractors there is a chance that it will not work with the previous source files. So having a source and its QA'd data file is useful to test against to verify that data is still extracting correctly.
- Run
python your_scraper.py create-test path_to/metadata_source_file
- The input file is the
*_metadata.json
file that gets created when you run the scraper and it downloads the source files.
- The input file is the
- This will copy the metadata file and the sources into the directory
tests/sample_data/your_scraper/
using the time the source was downloaded (from the metadata) as the file name.- It also creates extracted qa files for each of the sources based on your extractors.
- it extracts the data in json format to make it easy to qa and read.
- The QA files it created will have
_extracted_(qa)_
in the file name. What you have to do it confirm that all values are correct in that file. If everything looks good then fix the file name from having_extracted_(qa)_
to_extracted_qa_
. This will let the system know that the file has been checked and that is the data it will use to compare when testing. - Create an empty file
tests/__init__.py
. This is needed for the tests to run. - Next is to create the code that will run the tests. Create the file
tests/tests.py
with the contents below
import unittest # The testing frame work to use
from scraperx.test import ExtractorBaseTest # Does all the heavy lifting for the test
from your_scraper import scraper as my_scraper # The scrapers Scraper class
# If you have multiple scrapers, then import their extract classes here as well
# This test will loop through all the test files for the scraper
class YourScraper(ExtractorBaseTest.TestCase):
def __init__(self, *args, **kwargs):
# The directory that the test files for your scraper are in
data_dir = 'tests/sample_data/your_scraper'
# ignore_keys will not test the qa values to the current extracted test value. This is most useful when dealing with timestamps or other values that will change on each time the data is extracted
super().__init__(data_dir, my_scraper, ignore_keys=['time_extracted'], *args, **kwargs)
# If you have multiple scrapers, then create a class for each
# Feel free to include any other unit tests you may want to run as well
- Running the tests
python -m unittest discover -vv
3 Ways of setting config values:
- CLI Argument: Will override any other type of config value. Use
-h
to see available options - Environment variable: Will override a config value in the yaml
- Yaml file: Will use these values if no other way is set for a key
_If you are using any aws service for any part of this, it will use the boto3 library and will try and get credentials from the system
# config.yaml
# This is a config file with all config values
# Required fields are marked as such
default:
dispatch:
limit: 5 # Default None. Max number of tasks to dispatch. If not set, all tasks will run
service:
# This is where both the download and extractor services will run
name: local # (local, sns) Default: local
sns_arn: sns:arn:of:service:to:trigger # Required if `name` is sns, if local this is not needed
ratelimit:
type: qps # (qps, period) Required. `qps`: Queries per second to dispatch the tasks at. `period`: The time in hours to dispatch all of the tasks in.
value: 1 # Required. Can be an int or a float. When using period, value is in hours
downloader:
save_metadata: true # (true, false) Default: true. If false, a metadata file will NOT be saved with the downloaded source.
save_data:
service: local # (local, s3) Default: local
# Required if `service` is s3, if local these are not needed
bucket_name: my-downloaded-data-bucket
endpoint_url: https://s3.my-server.com # Used if not using AWS s3 bucket
# Only needed if aws creds are not setup on the system or you want to not use the system creds
aws_access_key_id: abcde # Auth key to access the s3 server.
aws_secret_access_key: abcde123 # Auth secret to access the s3 server
file_template: test_output/{scraper_name}/{id}_source.html # Optional, Default is "output/extracted.json"
extractor:
save_data:
service: local # (local, s3) Default: local
# Required if `service` is s3, if local these are not needed
bucket_name: my-extracted-data-bucket
endpoint_url: https://s3.my-server.com # Used if not using AWS s3 bucket
# Only needed if aws creds are not setup on the system or you want to not use the system creds
aws_access_key_id: abcde # Auth key to access the s3 server
aws_secret_access_key: abcde123 # Auth secret to access the s3 server
file_template: test_output/{scraper_name}/{id}_extracted.json # Optional, Default is "output/source.html"
If you are using the file_template
config, a python .format()
runs on this string so you can use {key_name}
to make it dynamic. The keys you will have direct access to are the following:
- All keys in your task that was dispatched
- Any thing you pass into the
template_values={}
kwarg for the.save()
fn - All values in
scraper.log_extras
. Currentlyscraper_name
&run_id
time_downloaded
: time (utc) passed from the downloader (in both the downloader and extractor)date_downloaded
: date (utc) passed from the downloader (in both the downloader and extractor)time_extracted
: time (utc) passed from the extractor (just in the extractor)date_extracted
: date (utc) passed from the extractor (just in the extractor)
Anything under the default
section can also have its own value per scraper. So if we have a scraper named search
and we want it to use a different rate limit then all the other scrapers you can do:
# Name of the python file
search:
dispatch:
ratelimit:
type: period
value: 5
To override the value
in the above snippet using an environment variable, set DISPATCH_RATELIMIT_VALUE=1
. This will override all dispatch ratelimit values in default and custom.
If you run into the error may have been in progress in another thread when fork() was called.
when running the scraper locally on a mac. Then set the env var export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
This is because of a security setting on macs when spawning threads from threads ansible/ansible#32499 (comment)