- :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` are now also enforced on requests sent through Zyte API.
- Added official Python 3.13 support, removed official Python 3.8 support.
- Fixed a race condition that could allow more Zyte API requests than those configured in the :setting:`ZYTE_API_MAX_REQUESTS` setting.
- Added support for
zyte_common_items.JobPostingNavigation
to the scrapy-poet provider.
- Added support for :ref:`custom attribute extraction <custom-attrs>`.
- Added the :class:`~scrapy_zyte_api.LocationSessionConfig` class.
- Fixed an issue in the handling of excessive session initialization failures
during session refreshing, which would manifest as an asyncio messages about
unretrieved
TooManyBadSessionInits
task exceptions instead of stopping the spider as intended.
scrapy-zyte-api[provider]
now requires :doc:`zyte-common-items <zyte-common-items:index>` 0.20.0+.- Added the :setting:`ZYTE_API_AUTO_FIELD_STATS` setting.
- Added the :func:`~scrapy_zyte_api.is_session_init_request` function.
- Added the :data:`~scrapy_zyte_api.session_config_registry` variable.
Backward-incompatible change: The precedence of session param settings, request metadata keys and session config override methods has changed.
Before, priority from higher to lower was:
- :meth:`~scrapy_zyte_api.SessionConfig.params`
- :meth:`~scrapy_zyte_api.SessionConfig.location`
- :reqmeta:`zyte_api_session_location`
- :setting:`ZYTE_API_SESSION_LOCATION`
- :reqmeta:`zyte_api_session_params`
- :setting:`ZYTE_API_SESSION_PARAMS`
Now, it is:
When using the :reqmeta:`zyte_api_session_params` or :reqmeta:`zyte_api_session_location` request metadata keys, a different pool ID is now generated by default based on their value. See :meth:`~scrapy_zyte_api.SessionConfig.pool` for details.
The new :reqmeta:`zyte_api_session_pool` request metadata key allows overriding the pool ID of a request.
Fixed some documentation examples where the parameters of the
check
method of :setting:`ZYTE_API_SESSION_CHECKER` were in reverse order.
If the :setting:`AUTOTHROTTLE_ENABLED <scrapy:AUTOTHROTTLE_ENABLED>` setting is
False
, the delay of download slots for Zyte API requests no longer resets to zero, and instead scrapy-zyte-api respects the :setting:`DOWNLOAD_DELAY <scrapy:DOWNLOAD_DELAY>` setting andzyte-api@
-prefixed entries in the :setting:`DOWNLOAD_SLOTS <scrapy:DOWNLOAD_SLOTS>` setting.A new :setting:`ZYTE_API_PRESERVE_DELAY` setting allows overriding this behavior, i.e. enabling delay resetting even if :setting:`AUTOTHROTTLE_ENABLED <scrapy:AUTOTHROTTLE_ENABLED>` is
False
or disabling delay resetting even if :setting:`AUTOTHROTTLE_ENABLED <scrapy:AUTOTHROTTLE_ENABLED>` isTrue
.The :reqmeta:`zyte_api_session_location` and :reqmeta:`zyte_api_session_params` request metadata keys, if present in a request that triggers a session initialization request, will be copied into the session initialization request, so that they are available when :setting:`ZYTE_API_SESSION_CHECKER` or :meth:`SessionConfig.check <scrapy_zyte_api.SessionConfig.check>` are called for a session initialization request.
The new :meth:`SessionConfig.enabled <scrapy_zyte_api.SessionConfig.enabled>` method allows configuring whether session management should be enabled or disabled for any given request.
A new stat,
scrapy-zyte-api/sessions/use/disabled
, indicates the number of requests for which session management was disabled.
- Implemented a :ref:`session management API <session>`.
- The recommended position for
ScrapyZyteAPIDownloaderMiddleware
changed from 1000 to 633, to accommodate for the newScrapyZyteAPISessionDownloaderMiddleware
, which needs to be afterScrapyZyteAPIDownloaderMiddleware
and before the Scrapy cookie downloader middleware (700).
- Now the :setting:`ZYTE_API_PROVIDER_PARAMS` setting and the :reqmeta:`zyte_api_provider` request metadata key can influence the resolution of an :class:`~web_poet.page_inputs.response.AnyResponse` dependency.
- The log messages from the download handler that indicate the source request
URL of an exception have switched from
ERROR
log level toDEBUG
. The exceptions themselves that follow those messages will still be logged as errors unless you handle them.
- The
Accept
,Accept-Encoding
,Accept-Language
, andUser-Agent
headers are now dropped automatically during :ref:`header mapping <header-mapping>` unless they have user-defined values. This fix can improve success rates on some websites when using :ref:`HTTP requests <zapi-http>`.
extractFrom
in :reqmeta:`zyte_api_provider` or :setting:`ZYTE_API_PROVIDER_PARAMS` overrides :class:`~scrapy_zyte_api.ExtractFrom` annotations.
- Updated requirement versions:
- A new :reqmeta:`zyte_api_provider` request metadata key offers the same functionality as the :setting:`ZYTE_API_PROVIDER_PARAMS` setting on a per-request basis.
- Fixed support for nested dicts, tuples and lists when defining :ref:`browser actions <browser-actions>`.
- :class:`scrapy_zyte_api.Addon` now adds
:class:`scrapy_zyte_api.providers.ZyteApiProvider` to the
SCRAPY_POET_PROVIDERS
:ref:`scrapy-poet setting <scrapy-poet:settings>` if :doc:`scrapy-poet <scrapy-poet:index>` is installed.
- Added a :class:`scrapy_zyte_api.Actions` dependency.
- Added a :class:`scrapy_zyte_api.Screenshot` dependency.
- Added support for Python 3.12.
- Updated requirement versions:
- :doc:`scrapy-poet <scrapy-poet:index>` >= 0.22.0
- :doc:`web-poet <web-poet:index>` >= 0.17.0
- Added a Scrapy add-on, :class:`scrapy_zyte_api.Addon`, which simplifies
configuring Scrapy projects to work with
scrapy-zyte-api
. - CI improvements.
- Fix
"extractFrom": "httpResponseBody"
causing both :http:`request:customHttpRequestHeaders` and :http:`request:requestHeaders`, which are incompatible with each other, to be set when using automatic request mapping.
- Removed support for Python 3.7.
- Updated requirement versions:
- :doc:`scrapy-poet <scrapy-poet:index>` >= 0.21.0
- :doc:`web-poet <web-poet:index>` >= 0.16.0
- Added support for :class:`web_poet.AnyResponse <web_poet.page_inputs.response.AnyResponse>` dependency.
- Added support to specify the country code via :class:`typing.Annotated` and :class:`scrapy_zyte_api.Geolocation` dependency (supported only on Python 3.9+).
- Improved tests.
Updated requirement versions:
- :doc:`scrapy-poet <scrapy-poet:index>` >= 0.20.1
Dependency injection :ref:`through scrapy-poet <scrapy-poet>` is now taken into account for request fingerprinting.
Now, when scrapy-poet is installed, the default value of the :setting:`ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS` setting is :class:`scrapy_poet.ScrapyPoetRequestFingerprinter`, and a warning will be issued if a custom value is not a subclass of :class:`~scrapy_poet.ScrapyPoetRequestFingerprinter`.
:ref:`Zyte Smart Proxy Manager special headers <spm-request-headers>` will now be dropped automatically when using :ref:`transparent mode <transparent>` or :ref:`automatic request parameters <automap>`. Where possible, they will be replaced with equivalent Zyte API parameters. In all cases, a warning will be issued.
Covered the configuration of :class:`scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware` in the :ref:`setup documentation <setup>`.
:class:`~scrapy_zyte_api.ScrapyZyteAPISpiderMiddleware` was added in scrapy-zyte-api 0.13.0, and is required to automatically close spiders when all start requests fail because they are pointing to domains forbidden by Zyte API.
The assignment of a custom download slot to requests that use Zyte API now also happens in the spider middleware, not only in the downloader middleware.
This way requests get a download slot assigned before they reach the scheduler, making Zyte API requests work as expected with :class:`scrapy.pqueues.DownloaderAwarePriorityQueue`.
Note
New requests created from downloader middlewares do not get their download slot assigned before they reach the scheduler. So, unless they reuse the metadata from a requests that did get a download slot assigned (e.g. retries, redirects), they will continue not to work as expected with :class:`~scrapy.pqueues.DownloaderAwarePriorityQueue`.
- Updated requirement versions:
- andi >= 0.6.0
- scrapy-poet >= 0.19.0
- zyte-common-items >= 0.8.0
- Added support for
zyte_common_items.JobPosting
to the scrapy-poet provider.
- Updated requirement versions:
- andi >= 0.5.0
- scrapy-poet >= 0.18.0
- web-poet >= 0.15.1
- zyte-api >= 0.4.8
- The spider is now closed and the finish reason is set to
"zyte_api_bad_key"
or"zyte_api_suspended_account"
when receiving "Authentication Key Not Found" or "Account Suspended" responses from Zyte API. - The spider is now closed and the finish reason is set to
"failed_forbidden_domain"
when all start requests fail because they are pointing to domains forbidden by Zyte API. - The spider is now closed and the finish reason is set to
"plugin_conflict"
if both scrapy-zyte-smartproxy and the transparent mode of scrapy-zyte-api are enabled. - The
extractFrom
extraction option can now be requested by annotating the dependency with ascrapy_zyte_api.ExtractFrom
member (e.g.product: typing.Annotated[Product, ExtractFrom.httpResponseBody]
). - The
Set-Cookie
header is now removed from the response if the cookies were returned by Zyte API (as"experimental.responseCookies"
). - The request fingerprinting was improved by refining which parts of the request affect the fingerprint.
- Zyte API Request IDs are now included in the error logs.
- Split README.rst into multiple documentation files and publish them on ReadTheDocs.
- Improve the documentation for the
ZYTE_API_MAX_REQUESTS
setting. - Test and CI improvements.
- Unused
<data type>Options
(e.g.productOptions
) are now dropped fromZYTE_API_PROVIDER_PARAMS
when sending the Zyte API request - When logging Zyte API requests, truncation now uses "..." instead of Unicode ellipsis.
The new
_ZYTE_API_USER_AGENT
setting allows customizing the user agent string reported to Zyte API.Note that this setting is only meant for libraries and frameworks built on top of scrapy-zyte-api, to report themselves to Zyte API, for client software tracking and monitoring purposes. The value of this setting is not the
User-Agent
header sent to upstream websites when using Zyte API.
A new
ZYTE_API_PROVIDER_PARAMS
setting allows setting Zyte API parameters, likegeolocation
, to be included in all Zyte API requests by the scrapy-poet provider.A new
scrapy-zyte-api/request_args/<parameter>
stat, counts the number of requests containing a given Zyte API request parameter. For example,scrapy-zyte-api/request_args/url
counts the number of Zyte API requests with the URL parameter set (which should be all of them).Experimental is treated as a namespace, and its parameters are the ones counted, i.e. there is no
scrapy-zyte-api/request_args/experimental
stat, but there are stats likescrapy-zyte-api/request_args/experimental.responseCookies
.
- scrapy-zyte-api 0.11.0 accidentally increased the minimum required version of scrapy-poet from 0.10.0 to 0.11.0. We have reverted that change and implemented measures to prevent similar accidents in the future.
- Automatic parameter mapping no longer warns about dropping the
Accept-Encoding
header when the header value matches the Scrapy default. - The README now mentions additional changes that may be necessary when switching Twisted reactors on existing projects.
- The README now explains how status codes, from Zyte API or from wrapped responses, are reflected in Scrapy stats.
- Added a
ZYTE_API_MAX_REQUESTS
setting to limit the number of successful Zyte API requests that a spider can send. Reaching the limit stops the spider. - Setting
requestCookies
to[]
in thezyte_api_automap
request metadata field now triggers a warning.
- Added more data types to the scrapy-poet provider:
zyte_common_items.ProductList
zyte_common_items.ProductNavigation
zyte_common_items.Article
zyte_common_items.ArticleList
zyte_common_items.ArticleNavigation
- Moved the new dependencies added in 0.9.0 and needed only for the scrapy-poet
provider (
scrapy-poet
,web-poet
,zyte-common-items
) into the new optional feature[provider]
. - Improved result caching in the scrapy-poet provider.
- Added a new setting,
ZYTE_API_USE_ENV_PROXY
, which can be set toTrue
to access Zyte API using a proxy configured in the local environment. - Fixed getting the Scrapy Cloud job ID.
- Improved the documentation.
- Improved the CI configuration.
- New and updated requirements:
- packaging >= 20.0
- scrapy-poet >= 0.9.0
- web-poet >= 0.13.0
- zyte-common-items
- Added a scrapy-poet provider for Zyte API. Currently supported data types:
web_poet.BrowserHtml
web_poet.BrowserResponse
zyte_common_items.Product
- Added a
zyte_api_default_params
request meta key which allows users to ignore theZYTE_API_DEFAULT_PARAMS
setting for individual requests. - CI fixes.
- Fixed an exception raised by the downloader middleware when cookies were enabled.
- Made Python 3.11 support official.
- Added support for the upcoming automatic extraction feature of Zyte API.
- Included a descriptive message in the exception that triggers when the download handler cannot be initialized.
- Clarified that
LOG_LEVEL
must beDEBUG
forZYTE_API_LOG_REQUESTS
messages to be visible.
- Fixed the handling of response cookies without a domain.
- CI fixes
- Fixed an
AssertionError
when cookies are disabled. - Added links to the README to improve navigation from GitHub.
- Added a license file (BSD-3-Clause).
Added experimental cookie support:
- The
experimental.responseCookies
response parameter is now mapped to the response headers asSet-Cookie
headers, as well as added to the cookiejar of the request. - A new boolean setting,
ZYTE_API_EXPERIMENTAL_COOKIES_ENABLED
, can be set toTrue
to enable automatic mapping of cookies from a request cookiejar into theexperimental.requestCookies
Zyte API parameter.
- The
ZyteAPITextResponse
is now a subclass ofHtmlResponse
, so that theopen_in_browser
function of Scrapy uses the.html
extension for Zyte API responses.While not ideal, this is much better than the previous behavior, where the
.html
extension was never used for Zyte API responses.ScrapyZyteAPIDownloaderMiddleware
now also supports non-string slot IDs.
- It is now possible to log the parameters of requests sent.
- Stats for HTTP and HTTPS traffic used to be kept separate, and only one of those sets of stats would be reported. This is fixed now.
- Fixed some code examples and references in the README.
When upgrading, you should set the following in your Scrapy settings:
DOWNLOADER_MIDDLEWARES = {
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 633,
}
# only applicable for Scrapy 2.7+
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
Fixes the issue where scrapy-zyte-api is slow when Scrapy Cloud has Autothrottle Addon enabled. The new
ScrapyZyteAPIDownloaderMiddleware
fixes this.It now supports Scrapy 2.7's new
REQUEST_FINGERPRINTER_CLASS
which ensures that Zyte API requests are properly fingerprinted. This addresses the issue where Scrapy marks POST requests as duplicate if they point to the same URL despite having different request bodies. As a workaround, users were marking their requests withdont_filter=True
to prevent such dupe filtering.For users having
scrapy >= 2.7
, you can simply update your Scrapy settings to haveREQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
.If your Scrapy project performs other requests aside from Zyte API, you can set
ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS = "custom.RequestFingerprinter"
to allow custom fingerprinting. By default, the default Scrapy request fingerprinter is used for non-Zyte API requests.For users having
scrapy < 2.7
, check the following link to see different ways on handling the duplicate request issue: https://github.com/scrapy-plugins/scrapy-zyte-api#request-fingerprinting-before-scrapy-27.More information about the request fingerprinting topic can be found in https://github.com/scrapy-plugins/scrapy-zyte-api#request-fingerprinting.
Various improvements to docs and tests.
- Add a
ZYTE_API_TRANSPARENT_MODE
setting,False
by default, which can be set toTrue
to make all requests use Zyte API by default, with request parameters being automatically mapped to Zyte API parameters. - Add a Request meta key,
zyte_api_automap
, that can be used to enable automatic request parameter mapping for specific requests, or to modify the outcome of automatic request parameter mapping for specific requests. - Add a
ZYTE_API_AUTOMAP_PARAMS
setting, which is a counterpart forZYTE_API_DEFAULT_PARAMS
that applies to requests where automatic request parameter mapping is enabled. - Add the
ZYTE_API_SKIP_HEADERS
andZYTE_API_BROWSER_HEADERS
settings to control the automatic mapping of request headers. - Add a
ZYTE_API_ENABLED
setting,True
by default, which can be used to disable this plugin. - Document how Zyte API responses are mapped to Scrapy response subclasses.
- Raise the minimum dependency of Zyte API's Python API to
zyte-api>=0.4.0
. This changes all the requests to Zyte API to have haveAccept-Encoding: br
and automatically decompress brotli responses. - Rename "Zyte Data API" to simply "Zyte API" in the README.
- Lower the minimum Scrapy version from
2.6.0
to2.0.1
.
- Zyte Data API error responses (after retries) are no longer ignored, and
instead raise a
zyte_api.aio.errors.RequestError
exception, which allows user-side handling of errors and provides better feedback for debugging. - Allowed retry policies to be specified as import path strings, which is
required for the
ZYTE_API_RETRY_POLICY
setting, and allows requests with thezyte_api_retry_policy
request.meta key to remain serializable. - Fixed the naming of stats for some error types.
- Updated the output examples on the README.
- Cleaned up Scrapy stats names: fixed an issue with
//
, renamedscrapy-zyte-api/api_error_types/..
toscrapy-zyte-api/error_types/..
, addedscrapy-zyte-api/error_types/<empty>
for cases error type is unknown; - Added error type to the error log messages
- Testing improvements
Fixed incorrect 0.4.0 release.
- Requires a more recent Python client library zyte-api ≥ 0.3.0.
- Stats from zyte-api are now copied into Scrapy stats. The
scrapy-zyte-api/request_count
stat has been renamed toscrapy-zyte-api/processed
accordingly.
CONCURRENT_REQUESTS
Scrapy setting is properly supported; in previous releases max concurrency of Zyte API requests was limited to 15.- The retry policy for Zyte API requests can be overridden, using
either
ZYTE_API_RETRY_POLICY
setting orzyte_api_retry_policy
request.meta key. - Proper response.status is set when Zyte API returns
statusCode
field. - URL of the Zyte API server can be set using
ZYTE_API_URL
Scrapy setting. This feature is currently used in tests. - The minimum required Scrapy version (2.6.0) is now enforced in setup.py.
- Test and documentation improvements.
Remove the
Content-Decoding
header when returning the responses. This prevents Scrapy from decompressing already decompressed contents done by Zyte Data API. Otherwise, this leads to errors inside Scrapy'sHttpCompressionMiddleware
.Introduce
ZyteAPIResponse
andZyteAPITextResponse
which are subclasses ofscrapy.http.Response
andscrapy.http.TextResponse
respectively. These new response classes hold the raw Zyte Data API response in theraw_api_response
attribute.Introduce a new setting named
ZYTE_API_DEFAULT_PARAMS
.- At the moment, this only applies to Zyte API enabled
scrapy.Request
(which is declared by having thezyte_api
parameter in the Request meta having valid parameters, set toTrue
, or{}
).
- At the moment, this only applies to Zyte API enabled
Specify in the README to set
dont_filter=True
when using the same URL but with differentzyte_api
parameters in the Request meta. This is a current workaround since Scrapy will tag them as duplicate requests and will result in duplication filtering.Various documentation improvements.
- Initial release