-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDS: harvest directly from OAI-PMH #198
Conversation
@szymonlopaciuk here's the same code we talked about yesterday, it still requires changes though, like the state management and such. |
hepcrawl/spiders/oaipmh_spider.py
Outdated
now = datetime.utcnow() | ||
request = Request('oaipmh+{}'.format(self.url), self.parse) | ||
yield request | ||
self.state[self.alias] = self._format_date(now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-caro it does support state management, provided the spider is launched with the appropriate flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state is not persisted, it only exists in this instance of the spider (it should be using also the StatefulSpider
). We have to find a way to properly persist it. The one we were using before (by enabling the job dir https://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches), does not work for parallel runs on scrapyd, as the job dir is shared between the parallel runs of the spider, they get race conditions reading/writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot you adapt the crawl once thing to work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crawl-once uses a local db to store key-values, maybe we would be able to use it for this, though I'm not sure if it's abusing too much.
hepcrawl/spiders/oaipmh_spider.py
Outdated
name = 'OAI-PMH' | ||
state = {} | ||
|
||
def __init__(self, url, metadata_prefix='marcxml', set=None, alias=None, from_date=None, until_date=None, granularity='YYYY-MM-DD', record_class=Record, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set
is a builtin, this should be set_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe should be sets
? in plural?
hepcrawl/spiders/oaipmh_spider.py
Outdated
else: | ||
raise RuntimeError("Invalid granularity: %s" % self.granularity) | ||
|
||
def _make_alias(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? you can have tuple
s as keys of dict
s.
hepcrawl/spiders/oaipmh_spider.py
Outdated
'GetRecord': self.record_class, | ||
}) | ||
try: | ||
records = sickle.ListRecords(**{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not simply key=value
instead of passing a dict
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from
is a restricted keyword
hepcrawl/spiders/oaipmh_spider.py
Outdated
""" | ||
return record.xml | ||
|
||
def parse(self, response): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response
seems not to be used. Is it receiving no useful info from start_requests
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the dummy response that is then ignored.
hepcrawl/spiders/oaipmh_spider.py
Outdated
elif self.granularity == 'YYYY-MM-DDThh:mm:ssZ': | ||
return datetime_object.strftime('%Y-%m-%dT%H:%M:%SZ') | ||
else: | ||
raise RuntimeError("Invalid granularity: %s" % self.granularity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this is a ValueError
.
from scrapy.http import Response | ||
|
||
|
||
class DummyDownloadHandler(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be needed because for oai-pmh, you pass oaipmh+http(s)
in the url. Cannot you just pass the real URL starting with http(s)
instead and skip this whole download handler business?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that we're using Sickle to do OAI stuff, and it does all of it's downloading on it's own, only exposing an iterator on the records. Since it doesn't expose any of it's requests/responses, there is not a nice way to integrate it with Scrapy so we bypass it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO nope, because scrapy
would otherwise try to fetch the URL with its default methods thus not allowing me to pass the ball to sickle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I understand.XMLFeedSpider
seems to have no such problems...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so IIUC, this whole contraption is done so that in the end, start_requests
calls parse
. If that's all you want, cannot you call parse
from start_requests
directly, given that you don't care about the response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying something like this yesterday and was having issues. I didn't investigate very deeply, but I think Scrapy needs a request to begin crawling (either through start_requests
or url
). It expects start_requests
to return a Request
and doesn't allow anything else.
hepcrawl/spiders/cds_spider.py
Outdated
def parse_record(self, record): | ||
response = XmlResponse(self.url, encoding='utf-8', body=record.raw) | ||
selector = Selector(response, type='xml') | ||
selector.remove_namespaces() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these three lines should probably be part of the OAIPMHSpider
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if in general one wants to always remove namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the first two lines then :)
hepcrawl/spiders/cds_spider.py
Outdated
def start_requests(self): | ||
yield Request(self.source_file) | ||
def __init__(self, from_date=None, set="forINSPIRE", *args, **kwargs): | ||
super(CDSSpider, self).__init__(url='http://cds.cern.ch/oai2d', metadata_prefix='marcxml', set=set, from_date=from_date, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not very nice. I think it's better to have the same kind of API as XMLFeedSpider
, just assigning some variables.
hepcrawl/scrapy.cfg
Outdated
@@ -14,7 +14,7 @@ | |||
default = hepcrawl.settings | |||
|
|||
[deploy] | |||
url = http://scrapyd:6800/ | |||
url = http://localhost:6800/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to run it locally (not in docker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just caught in my WIP-PR
1a529d1
to
b500aa0
Compare
c6fbed0
to
8ffacc7
Compare
hepcrawl/pipelines.py
Outdated
validate(hep_record, 'hep') | ||
spider.logger.debug('Validated item by Inspire Schemas.') | ||
except Exception as err: | ||
spider.logger.error('ERROR in validating {}: {}'.format(hep_record, err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure that this is needed, and if it is, change the .format
by error('message %s', message)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed safe to remove as I checked and scrapy does not stop unless a certain number of exceptions is raised (CLOSESPIDER_ERRORCOUNT
) which defaults to infinity. However, I removed this and all of the records failed with ValidationError: u'_collections' is a required property
. This seems odd as the record generation is done by DoJSON and the code looks basically the same as DESY spider. I will investigate, but maybe you know something about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing pops up right now, will have to check :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you using a version of doJSON that includes the CDS conversion? you didn't bump the version in setup.py, so maybe you are using an older version that does no preprocessing for CDS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michamos I bumped and reinstalled dependencies, I now have version 57.1.5, no change. Is there any special way I need to be using DoJSON in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after discussing IRL, turns out the new API that includes CDS conversion was not used.
hepcrawl/spiders/cds_spider.py
Outdated
from ..utils import ParsedItem | ||
|
||
logger = logging.getLogger(__name__) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two spaces between imports and globals and defs/class, and globals should be all caps.
hepcrawl/spiders/cds_spider.py
Outdated
with app.app_context(): | ||
json_record = hep.do(record) | ||
base_uri = self.settings['SCHEMA_BASE_URI'] | ||
json_record['$schema'] = base_uri + 'hep.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.path.join
hepcrawl/spiders/oaipmh_spider.py
Outdated
# -*- coding: utf-8 -*- | ||
# | ||
# This file is part of hepcrawl. | ||
# Copyright (C) 2015, 2016, 2017 CERN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
year here is just 2017
hepcrawl/spiders/oaipmh_spider.py
Outdated
from scrapy.http import Request, XmlResponse | ||
from scrapy.selector import Selector | ||
from . import StatefulSpider | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double spaces and uppercase globals
hepcrawl/spiders/oaipmh_spider.py
Outdated
next harvest. | ||
""" | ||
name = 'OAI-PMH' | ||
granularity = _Granularity.DATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's drop the granularity stuff, we can always implement it later if needed.
hepcrawl/spiders/oaipmh_spider.py
Outdated
|
||
def _make_alias(self): | ||
return '{url}?metadataPrefix={metadata_prefix}&set={set}'.format( | ||
url=self.url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use the url for the alias
hepcrawl/spiders/oaipmh_spider.py
Outdated
self.metadata_prefix = metadata_prefix | ||
self.set = oai_set | ||
self.granularity = granularity | ||
self.alias = alias or self._make_alias() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property is not needed
hepcrawl/spiders/cds_spider.py
Outdated
except Exception: | ||
logger.exception("Error when parsing record") | ||
return None | ||
record = create_record(selector.xpath('.//record').extract()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract()[0]
is equivalent to extract_first()
(the former raises if the list is empty, the latter returns None
but can be overridden).
hepcrawl/spiders/oaipmh_spider.py
Outdated
self.alias = alias or self._make_alias() | ||
self.from_date = from_date | ||
self.until_date = until_date | ||
self.record_class = record_class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this record_class is not needed we should remove it.
hepcrawl/spiders/oaipmh_spider.py
Outdated
last_run = json.load(f) | ||
logger.info('Last run file loaded: {}'.format(repr(last_run))) | ||
return last_run | ||
except IOError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here check only if the file exists (first run) but any other errors should raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if it's not there, raise an exception (a subclass of Exception
with a meaningful name), then try-catch it in whomever calls this.
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Harvests CDS through dojson directly: closes inspirehep#199. Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Fixes a bug that would raise a TypeError when no last runs would have been present. Signed-off-by: Szymon Łopaciuk <[email protected]>
According to the docs the spider will only be closed after CLOSESPIDER_ERRORCOUNT number of exceptions, which by default allows ininitely many. Thus this is not needed, and it's better if we know if there are validation issues. Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Create an exception for when a last funs file doesn't exist. Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
hepcrawl/spiders/cds_spider.py
Outdated
self.settings.getdict('MARC_TO_HEP_SETTINGS', {}) | ||
) | ||
with app.app_context(): | ||
json_record = hep.do(record) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't use hep.do
but the new marcxml2record
API. Otherwise, the CDS conversion is not triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I also added this in the last PR about arXiv, as it was failing the tests, but 👍 if we don't need it.
Superseded by #216 |
This depends on #200.
Description
This implements a universal OAI-PMH spider to be used in harvesting records from OAI sources, in this instance from CDS. Drops HarvestingKit dependency from the CDS spider as inspirehep/inspire-dojson#165 added support do harvest CDS directly.
Issues
Closes #197 and #199.
Checklist
RFC
and look for it).