add ScrapyCloudCollectionCache #30

asadurski · 2021-11-30T10:30:25Z

Adds a custom cache that stores raw responses from AutoExtract to ScrapyCloud collections.
Adds settings:

AUTOEXTRACT_CACHE_COLLECTION
DEV_PROJECT
Adds docs.

(Bonus: minor fixes to docs).

codecov · 2021-11-30T10:31:27Z

Codecov Report

Merging #30 (6c0d2fe) into master (5baa342) will decrease coverage by 3.11%.
The diff coverage is 54.16%.

@@            Coverage Diff             @@
##           master      #30      +/-   ##
==========================================
- Coverage   85.24%   82.12%   -3.12%     
==========================================
  Files           9        9              
  Lines         488      526      +38     
==========================================
+ Hits          416      432      +16     
- Misses         72       94      +22

Impacted Files	Coverage Δ
scrapy_autoextract/utils.py	`36.36% <16.66%> (-23.64%)`	⬇️
scrapy_autoextract/cache.py	`60.27% <50.00%> (-4.02%)`	⬇️
scrapy_autoextract/providers.py	`91.30% <83.33%> (-1.46%)`	⬇️

ivanprado · 2021-11-30T15:34:28Z

scrapy_autoextract/cache.py

+        )
+
+    def close(self):
+        pass


Shouldn't be closing the sc client here?

That's right. Thanks!

ivanprado

@asadurski it looks great. I left a minor comment. The approach in general makes sense to me.

If I understand it well, it will create a collection using the jobid, so it will be empty, and then for every request there will be one request to the cache, even if nothing will be found there (because the request is new). This can be inefficient and can make crawling slower for no reason.

At the same time, it would be nice to have a reusable cache that is using collections. I imagine that all jobs for a particular spider could be using this cache, and this will speed up it by a lot!

So I propose here may be to be more flexible in the configuration:

To have a write-only flag, so that the responses are saved but no requests are done against the cache.
To have a way to customize the name of the collection to be used. So that it can be reused across jobs.

kmike · 2021-12-01T10:03:47Z

scrapy_autoextract/cache.py

+
+    @classmethod
+    def fingerprint(cls, request: Request) -> str:
+        return request.url


hey! What's the reason just an URL is used as a fingerprint?

Agreed, that probably wasn't optimal. Introduced the same fingerprint mechanism as in AutoExtractCache.
It was like this for simplicity - those cached items are later retrieved in QA tool, MATT, and compared with job items. Now I just have to modify the retrieval in MATT to reach to item contents instead of the collection key.

ivanprado · 2021-12-02T15:58:21Z

Hi @asadurski . I'm moving the cache mechanism to scrapy-poet here scrapinghub/scrapy-poet#55. In the future, it might be a good idea to remove the cache from here at all and putting all there. Meanwhile, we can keep the one here as well.

…responses

add ScrapyCloudCollectionCache

81805e4

ivanprado reviewed Nov 30, 2021

View reviewed changes

kmike reviewed Dec 1, 2021

View reviewed changes

asadurski added 2 commits December 23, 2021 18:00

implement suggestions from PR comments

e913128

Merge remote-tracking branch 'origin/master' into record-autoextract-…

6c0d2fe

…responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ScrapyCloudCollectionCache #30

add ScrapyCloudCollectionCache #30

asadurski commented Nov 30, 2021

codecov bot commented Nov 30, 2021 •

edited

Loading

ivanprado Nov 30, 2021

asadurski Dec 23, 2021

ivanprado left a comment •

edited

Loading

kmike Dec 1, 2021

asadurski Dec 23, 2021

ivanprado commented Dec 2, 2021

add ScrapyCloudCollectionCache #30

Are you sure you want to change the base?

add ScrapyCloudCollectionCache #30

Conversation

asadurski commented Nov 30, 2021

codecov bot commented Nov 30, 2021 • edited Loading

Codecov Report

ivanprado Nov 30, 2021

Choose a reason for hiding this comment

asadurski Dec 23, 2021

Choose a reason for hiding this comment

ivanprado left a comment • edited Loading

Choose a reason for hiding this comment

kmike Dec 1, 2021

Choose a reason for hiding this comment

asadurski Dec 23, 2021

Choose a reason for hiding this comment

ivanprado commented Dec 2, 2021

codecov bot commented Nov 30, 2021 •

edited

Loading

ivanprado left a comment •

edited

Loading