Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDS: harvest directly from OAI-PMH #198

Closed
wants to merge 21 commits into from
Closed

Conversation

kaplun
Copy link
Contributor

@kaplun kaplun commented Dec 7, 2017

This depends on #200.

Description

This implements a universal OAI-PMH spider to be used in harvesting records from OAI sources, in this instance from CDS. Drops HarvestingKit dependency from the CDS spider as inspirehep/inspire-dojson#165 added support do harvest CDS directly.

Issues

Closes #197 and #199.

Checklist

  • I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@david-caro
Copy link
Contributor

@szymonlopaciuk here's the same code we talked about yesterday, it still requires changes though, like the state management and such.

now = datetime.utcnow()
request = Request('oaipmh+{}'.format(self.url), self.parse)
yield request
self.state[self.alias] = self._format_date(now)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-caro it does support state management, provided the spider is launched with the appropriate flag.

Copy link
Contributor

@david-caro david-caro Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state is not persisted, it only exists in this instance of the spider (it should be using also the StatefulSpider). We have to find a way to properly persist it. The one we were using before (by enabling the job dir https://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches), does not work for parallel runs on scrapyd, as the job dir is shared between the parallel runs of the spider, they get race conditions reading/writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot you adapt the crawl once thing to work here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crawl-once uses a local db to store key-values, maybe we would be able to use it for this, though I'm not sure if it's abusing too much.

name = 'OAI-PMH'
state = {}

def __init__(self, url, metadata_prefix='marcxml', set=None, alias=None, from_date=None, until_date=None, granularity='YYYY-MM-DD', record_class=Record, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set is a builtin, this should be set_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should be sets? in plural?

else:
raise RuntimeError("Invalid granularity: %s" % self.granularity)

def _make_alias(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? you can have tuples as keys of dicts.

'GetRecord': self.record_class,
})
try:
records = sickle.ListRecords(**{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not simply key=value instead of passing a dict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from is a restricted keyword

"""
return record.xml

def parse(self, response):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response seems not to be used. Is it receiving no useful info from start_requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the dummy response that is then ignored.

elif self.granularity == 'YYYY-MM-DDThh:mm:ssZ':
return datetime_object.strftime('%Y-%m-%dT%H:%M:%SZ')
else:
raise RuntimeError("Invalid granularity: %s" % self.granularity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this is a ValueError.

from scrapy.http import Response


class DummyDownloadHandler(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be needed because for oai-pmh, you pass oaipmh+http(s) in the url. Cannot you just pass the real URL starting with http(s) instead and skip this whole download handler business?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we're using Sickle to do OAI stuff, and it does all of it's downloading on it's own, only exposing an iterator on the records. Since it doesn't expose any of it's requests/responses, there is not a nice way to integrate it with Scrapy so we bypass it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO nope, because scrapy would otherwise try to fetch the URL with its default methods thus not allowing me to pass the ball to sickle.

Copy link
Contributor

@michamos michamos Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLFeedSpider seems to have no such problems... OK, I understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so IIUC, this whole contraption is done so that in the end, start_requests calls parse. If that's all you want, cannot you call parse from start_requests directly, given that you don't care about the response?

Copy link
Contributor

@szymonlopaciuk szymonlopaciuk Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying something like this yesterday and was having issues. I didn't investigate very deeply, but I think Scrapy needs a request to begin crawling (either through start_requests or url). It expects start_requests to return a Request and doesn't allow anything else.

def parse_record(self, record):
response = XmlResponse(self.url, encoding='utf-8', body=record.raw)
selector = Selector(response, type='xml')
selector.remove_namespaces()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these three lines should probably be part of the OAIPMHSpider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if in general one wants to always remove namespaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the first two lines then :)

def start_requests(self):
yield Request(self.source_file)
def __init__(self, from_date=None, set="forINSPIRE", *args, **kwargs):
super(CDSSpider, self).__init__(url='http://cds.cern.ch/oai2d', metadata_prefix='marcxml', set=set, from_date=from_date, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very nice. I think it's better to have the same kind of API as XMLFeedSpider, just assigning some variables.

@@ -14,7 +14,7 @@
default = hepcrawl.settings

[deploy]
url = http://scrapyd:6800/
url = http://localhost:6800/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to run it locally (not in docker)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just caught in my WIP-PR

@szymonlopaciuk szymonlopaciuk force-pushed the cds2 branch 2 times, most recently from 1a529d1 to b500aa0 Compare December 7, 2017 16:19
@szymonlopaciuk szymonlopaciuk changed the title WIP CDS: harvest directly from OAI-PMH Dec 8, 2017
@szymonlopaciuk szymonlopaciuk force-pushed the cds2 branch 3 times, most recently from c6fbed0 to 8ffacc7 Compare December 12, 2017 16:11
validate(hep_record, 'hep')
spider.logger.debug('Validated item by Inspire Schemas.')
except Exception as err:
spider.logger.error('ERROR in validating {}: {}'.format(hep_record, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that this is needed, and if it is, change the .format by error('message %s', message).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed safe to remove as I checked and scrapy does not stop unless a certain number of exceptions is raised (CLOSESPIDER_ERRORCOUNT) which defaults to infinity. However, I removed this and all of the records failed with ValidationError: u'_collections' is a required property. This seems odd as the record generation is done by DoJSON and the code looks basically the same as DESY spider. I will investigate, but maybe you know something about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing pops up right now, will have to check :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using a version of doJSON that includes the CDS conversion? you didn't bump the version in setup.py, so maybe you are using an older version that does no preprocessing for CDS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michamos I bumped and reinstalled dependencies, I now have version 57.1.5, no change. Is there any special way I need to be using DoJSON in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after discussing IRL, turns out the new API that includes CDS conversion was not used.

from ..utils import ParsedItem

logger = logging.getLogger(__name__)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two spaces between imports and globals and defs/class, and globals should be all caps.

with app.app_context():
json_record = hep.do(record)
base_uri = self.settings['SCHEMA_BASE_URI']
json_record['$schema'] = base_uri + 'hep.json'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join

# -*- coding: utf-8 -*-
#
# This file is part of hepcrawl.
# Copyright (C) 2015, 2016, 2017 CERN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

year here is just 2017

from scrapy.http import Request, XmlResponse
from scrapy.selector import Selector
from . import StatefulSpider

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double spaces and uppercase globals

next harvest.
"""
name = 'OAI-PMH'
granularity = _Granularity.DATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop the granularity stuff, we can always implement it later if needed.


def _make_alias(self):
return '{url}?metadataPrefix={metadata_prefix}&set={set}'.format(
url=self.url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use the url for the alias

self.metadata_prefix = metadata_prefix
self.set = oai_set
self.granularity = granularity
self.alias = alias or self._make_alias()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property is not needed

except Exception:
logger.exception("Error when parsing record")
return None
record = create_record(selector.xpath('.//record').extract()[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract()[0] is equivalent to extract_first() (the former raises if the list is empty, the latter returns None but can be overridden).

self.alias = alias or self._make_alias()
self.from_date = from_date
self.until_date = until_date
self.record_class = record_class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this record_class is not needed we should remove it.

last_run = json.load(f)
logger.info('Last run file loaded: {}'.format(repr(last_run)))
return last_run
except IOError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here check only if the file exists (first run) but any other errors should raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if it's not there, raise an exception (a subclass of Exception with a meaningful name), then try-catch it in whomever calls this.

Signed-off-by: Szymon Łopaciuk <[email protected]>
Harvests CDS through dojson directly: closes inspirehep#199.

Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Fixes a bug that would raise a TypeError when no last runs
would have been present.

Signed-off-by: Szymon Łopaciuk <[email protected]>
According to the docs the spider will only be closed after
CLOSESPIDER_ERRORCOUNT number of exceptions, which by default
allows ininitely many. Thus this is not needed, and it's
better if we know if there are validation issues.

Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Create an exception for when a last funs file doesn't exist.

Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
self.settings.getdict('MARC_TO_HEP_SETTINGS', {})
)
with app.app_context():
json_record = hep.do(record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't use hep.do but the new marcxml2record API. Otherwise, the CDS conversion is not triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I also added this in the last PR about arXiv, as it was failing the tests, but 👍 if we don't need it.

@szymonlopaciuk
Copy link
Contributor

Superseded by #216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Directly harvest oai sources
4 participants