Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ScrapyCloudCollectionCache #30

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

asadurski
Copy link
Member

Adds a custom cache that stores raw responses from AutoExtract to ScrapyCloud collections.
Adds settings:

  • AUTOEXTRACT_CACHE_COLLECTION
  • DEV_PROJECT
    Adds docs.

(Bonus: minor fixes to docs).

@codecov
Copy link

codecov bot commented Nov 30, 2021

Codecov Report

Merging #30 (6c0d2fe) into master (5baa342) will decrease coverage by 3.11%.
The diff coverage is 54.16%.

@@            Coverage Diff             @@
##           master      #30      +/-   ##
==========================================
- Coverage   85.24%   82.12%   -3.12%     
==========================================
  Files           9        9              
  Lines         488      526      +38     
==========================================
+ Hits          416      432      +16     
- Misses         72       94      +22     
Impacted Files Coverage Δ
scrapy_autoextract/utils.py 36.36% <16.66%> (-23.64%) ⬇️
scrapy_autoextract/cache.py 60.27% <50.00%> (-4.02%) ⬇️
scrapy_autoextract/providers.py 91.30% <83.33%> (-1.46%) ⬇️

)

def close(self):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be closing the sc client here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Thanks!

Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asadurski it looks great. I left a minor comment. The approach in general makes sense to me.

If I understand it well, it will create a collection using the jobid, so it will be empty, and then for every request there will be one request to the cache, even if nothing will be found there (because the request is new). This can be inefficient and can make crawling slower for no reason.

At the same time, it would be nice to have a reusable cache that is using collections. I imagine that all jobs for a particular spider could be using this cache, and this will speed up it by a lot!

So I propose here may be to be more flexible in the configuration:

  • To have a write-only flag, so that the responses are saved but no requests are done against the cache.
  • To have a way to customize the name of the collection to be used. So that it can be reused across jobs.


@classmethod
def fingerprint(cls, request: Request) -> str:
return request.url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey! What's the reason just an URL is used as a fingerprint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that probably wasn't optimal. Introduced the same fingerprint mechanism as in AutoExtractCache.
It was like this for simplicity - those cached items are later retrieved in QA tool, MATT, and compared with job items. Now I just have to modify the retrieval in MATT to reach to item contents instead of the collection key.

@ivanprado
Copy link
Contributor

Hi @asadurski . I'm moving the cache mechanism to scrapy-poet here scrapinghub/scrapy-poet#55. In the future, it might be a good idea to remove the cache from here at all and putting all there. Meanwhile, we can keep the one here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants