- Added
ignore_missing_null_keys
argument for running tests. Defaults toFalse
which is the same behavior as before. If set toTrue
, it will ignore any missing keys in the test data that areNone
in the extracted data. This is useful when you aded a new filed that does not exist in the older test files. This way you do not need to alwyas updated older test files if not needed.
- Allow a scrapers
extract_task
callback return a list of dicts, instead of just a single dict. This will allow for a scraper to extract multiple items from a single listing if needed while treating them like separate results.- Warning: If your scraper currently returns a list of extracts within its extract_task callback, the
post_extract
task now runs on each item and not the list as a whole.
- Warning: If your scraper currently returns a list of extracts within its extract_task callback, the
- Always use
utf-8
for reading and writing files - Fixes comparing test data (#1)
- Fixed json encoding error in tests when displaying the
type
of a value
- Support for multiple scrapers in a single file, the passed in "scraper_name" will now be used for the config settings. It will still default to the file name if not supplied
- Fixed bug when reading in files to compare when running the scrapers unittests
- Fixed error when checking the files encoding when using the
create-test
sub command
- Fixed encoding detection to work when extracting files from s3
- Added some testing around file encoding and rate limiting
- When reading & writing files, use the
cchardet
library to detect the correct file encoding
- Fixed reading tests sample data directory on windows now that pathlib is used
- Revert of 0.5.3, do NOT ignore the unicode errors. Will need to find another solution to this when creating tests on both windows and mac/linux
- Ignore unicode errors when reading a file
- Updated filepaths in scraper create-test command. Now in windows it will save the path with forward slashes
/
, and not\\
(supported by pathlib) - Fixed outdated examples to use a new site
- Added 'default' as a QA option. If the key is not set in the dict returned by the extractor, it will use the default
- Started this change log
- Added
run_id
to all scraper logs. Also to the scrapers config values - Have all scraper logs pull its extras from
scraper.log_extras()
- Extraction error logs will have the scrapers correct filename and line number rather then where the library threw the exception
- Fixed bug of s3 endpoint not always getting set correctly for custom endpoints
- Added
pre_extract()
method to extract class, it will run after the__init__
, used for the user to setup class wide vars - Added aws access key id & secret override for the
DOWNLOADER
andEXTRACTOR
, see config section in readme