-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDS: harvest directly from OAI-PMH #198
Closed
Closed
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
fad1b50
create a OAI-PMH spider to use in CDS spider
kaplun 33c3ae5
refactor, test contents
szymonlopaciuk db2953f
parse_record takes the selector
szymonlopaciuk 4890aa1
spiders: OAI-PMH: continue where left off
szymonlopaciuk 80efc44
use celerymonitor in CDS tests
szymonlopaciuk adb1906
CDS spider: drop HarvestingKit (#199)
szymonlopaciuk fff7c95
remove unused import
szymonlopaciuk 4895b07
fix failure on lack of last runs file
szymonlopaciuk b7c3fc4
remove ignoring the exception on item validation
szymonlopaciuk bb5c834
style fixes
szymonlopaciuk acf9125
bump inspire-dojson~=57.0,>=57.1
szymonlopaciuk 9a4f285
remove record_class field, as Record is default
szymonlopaciuk 077c1f1
use os.path.json in cds_spider
szymonlopaciuk 054aa0b
remove url from the last_run file hash
szymonlopaciuk b3159f7
remove granularity, default to YYYY-MM-DD for now
szymonlopaciuk 10804f7
refactor tests
szymonlopaciuk 5851258
stricter error catching when loading last_runs
szymonlopaciuk 23c3d90
leave only a few test records, remove the rest
szymonlopaciuk 332071f
tests: naming nad don't load directly from file
szymonlopaciuk a96f3c4
make parse_record abstract
szymonlopaciuk 6b7d886
spiders: move Statetul and OAI to common module
szymonlopaciuk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# -*- coding: utf-8 -*- | ||
# | ||
# This file is part of hepcrawl. | ||
# Copyright (C) 2016, 2017 CERN. | ||
# | ||
# hepcrawl is a free software; you can redistribute it and/or modify it | ||
# under the terms of the Revised BSD License; see LICENSE file for | ||
# more details. | ||
|
||
"""Additional downloaders.""" | ||
|
||
|
||
from scrapy.http import Response | ||
|
||
|
||
class DummyDownloadHandler(object): | ||
def __init__(self, *args, **kwargs): | ||
pass | ||
|
||
def download_request(self, request, spider): | ||
url = request.url | ||
return Response(url, request=request) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# -*- coding: utf-8 -*- | ||
# | ||
# This file is part of hepcrawl. | ||
# Copyright (C) 2015, 2016, 2017, 2018 CERN. | ||
# | ||
# hepcrawl is a free software; you can redistribute it and/or modify it | ||
# under the terms of the Revised BSD License; see LICENSE file for | ||
# more details. | ||
|
||
from __future__ import absolute_import, division, print_function | ||
|
||
from .oaipmh_spider import OAIPMHSpider | ||
from .stateful_spider import StatefulSpider |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be needed because for oai-pmh, you pass
oaipmh+http(s)
in the url. Cannot you just pass the real URL starting withhttp(s)
instead and skip this whole download handler business?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that we're using Sickle to do OAI stuff, and it does all of it's downloading on it's own, only exposing an iterator on the records. Since it doesn't expose any of it's requests/responses, there is not a nice way to integrate it with Scrapy so we bypass it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO nope, because
scrapy
would otherwise try to fetch the URL with its default methods thus not allowing me to pass the ball tosickle
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I understand.XMLFeedSpider
seems to have no such problems...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so IIUC, this whole contraption is done so that in the end,
start_requests
callsparse
. If that's all you want, cannot you callparse
fromstart_requests
directly, given that you don't care about the response?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying something like this yesterday and was having issues. I didn't investigate very deeply, but I think Scrapy needs a request to begin crawling (either through
start_requests
orurl
). It expectsstart_requests
to return aRequest
and doesn't allow anything else.