CDS: harvest directly from OAI-PMH #198

kaplun · 2017-12-07T07:53:29Z

This depends on #200.

Description

This implements a universal OAI-PMH spider to be used in harvesting records from OAI sources, in this instance from CDS. Drops HarvestingKit dependency from the CDS spider as inspirehep/inspire-dojson#165 added support do harvest CDS directly.

Issues

Closes #197 and #199.

Checklist

I have all the information that I need (if not, move to RFC and look for it).
I linked the related issue(s) in the corresponding commit logs.
I wrote good commit log messages.
My code follows the code style of this project.
I've added any new docs if API/utils methods were added.
I have updated the existing documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

david-caro · 2017-12-07T08:46:27Z

@szymonlopaciuk here's the same code we talked about yesterday, it still requires changes though, like the state management and such.

kaplun · 2017-12-07T08:52:00Z

hepcrawl/spiders/oaipmh_spider.py

+        now = datetime.utcnow()
+        request = Request('oaipmh+{}'.format(self.url), self.parse)
+        yield request
+        self.state[self.alias] = self._format_date(now)


@david-caro it does support state management, provided the spider is launched with the appropriate flag.

The state is not persisted, it only exists in this instance of the spider (it should be using also the StatefulSpider). We have to find a way to properly persist it. The one we were using before (by enabling the job dir https://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches), does not work for parallel runs on scrapyd, as the job dir is shared between the parallel runs of the spider, they get race conditions reading/writing.

cannot you adapt the crawl once thing to work here?

crawl-once uses a local db to store key-values, maybe we would be able to use it for this, though I'm not sure if it's abusing too much.

michamos · 2017-12-07T09:07:40Z

hepcrawl/spiders/oaipmh_spider.py

+    name = 'OAI-PMH'
+    state = {}
+
+    def __init__(self, url, metadata_prefix='marcxml', set=None, alias=None, from_date=None, until_date=None, granularity='YYYY-MM-DD', record_class=Record, *args, **kwargs):


set is a builtin, this should be set_

maybe should be sets? in plural?

michamos · 2017-12-07T09:08:20Z

hepcrawl/spiders/oaipmh_spider.py

+        else:
+            raise RuntimeError("Invalid granularity: %s" % self.granularity)
+
+    def _make_alias(self):


why is this needed? you can have tuples as keys of dicts.

michamos · 2017-12-07T09:12:43Z

hepcrawl/spiders/oaipmh_spider.py

+            'GetRecord': self.record_class,
+        })
+        try:
+            records = sickle.ListRecords(**{


why not simply key=value instead of passing a dict?

from is a restricted keyword

michamos · 2017-12-07T09:15:38Z

hepcrawl/spiders/oaipmh_spider.py

+        """
+        return record.xml
+
+    def parse(self, response):


response seems not to be used. Is it receiving no useful info from start_requests?

This is the dummy response that is then ignored.

michamos · 2017-12-07T09:18:36Z

hepcrawl/spiders/oaipmh_spider.py

+        elif self.granularity == 'YYYY-MM-DDThh:mm:ssZ':
+            return datetime_object.strftime('%Y-%m-%dT%H:%M:%SZ')
+        else:
+            raise RuntimeError("Invalid granularity: %s" % self.granularity)


I would say that this is a ValueError.

michamos · 2017-12-07T09:20:58Z

hepcrawl/downloaders.py

+from scrapy.http import Response
+
+
+class DummyDownloadHandler(object):


this seems to be needed because for oai-pmh, you pass oaipmh+http(s) in the url. Cannot you just pass the real URL starting with http(s) instead and skip this whole download handler business?

The issue is that we're using Sickle to do OAI stuff, and it does all of it's downloading on it's own, only exposing an iterator on the records. Since it doesn't expose any of it's requests/responses, there is not a nice way to integrate it with Scrapy so we bypass it.

IMHO nope, because scrapy would otherwise try to fetch the URL with its default methods thus not allowing me to pass the ball to sickle.

~~XMLFeedSpider seems to have no such problems...~~ OK, I understand.

so IIUC, this whole contraption is done so that in the end, start_requests calls parse. If that's all you want, cannot you call parse from start_requests directly, given that you don't care about the response?

I was trying something like this yesterday and was having issues. I didn't investigate very deeply, but I think Scrapy needs a request to begin crawling (either through start_requests or url). It expects start_requests to return a Request and doesn't allow anything else.

michamos · 2017-12-07T09:23:13Z

hepcrawl/spiders/cds_spider.py

+    def parse_record(self, record):
+        response = XmlResponse(self.url, encoding='utf-8', body=record.raw)
+        selector = Selector(response, type='xml')
+        selector.remove_namespaces()


these three lines should probably be part of the OAIPMHSpider.

Not sure if in general one wants to always remove namespaces.

OK, the first two lines then :)

michamos · 2017-12-07T09:47:21Z

hepcrawl/spiders/cds_spider.py

-    def start_requests(self):
-        yield Request(self.source_file)
+    def __init__(self, from_date=None, set="forINSPIRE", *args, **kwargs):
+        super(CDSSpider, self).__init__(url='http://cds.cern.ch/oai2d', metadata_prefix='marcxml', set=set, from_date=from_date, **kwargs)


this is not very nice. I think it's better to have the same kind of API as XMLFeedSpider, just assigning some variables.

david-caro · 2017-12-07T09:52:24Z

hepcrawl/scrapy.cfg

@@ -14,7 +14,7 @@
 default = hepcrawl.settings

 [deploy]
-url = http://scrapyd:6800/
+url = http://localhost:6800/


why this change?

to run it locally (not in docker)

It was just caught in my WIP-PR

david-caro · 2017-12-13T12:12:36Z

hepcrawl/pipelines.py

+            validate(hep_record, 'hep')
+            spider.logger.debug('Validated item by Inspire Schemas.')
+        except Exception as err:
+            spider.logger.error('ERROR in validating {}: {}'.format(hep_record, err))


Make sure that this is needed, and if it is, change the .format by error('message %s', message).

This seemed safe to remove as I checked and scrapy does not stop unless a certain number of exceptions is raised (CLOSESPIDER_ERRORCOUNT) which defaults to infinity. However, I removed this and all of the records failed with ValidationError: u'_collections' is a required property. This seems odd as the record generation is done by DoJSON and the code looks basically the same as DESY spider. I will investigate, but maybe you know something about this.

Nothing pops up right now, will have to check :/

are you using a version of doJSON that includes the CDS conversion? you didn't bump the version in setup.py, so maybe you are using an older version that does no preprocessing for CDS.

@michamos I bumped and reinstalled dependencies, I now have version 57.1.5, no change. Is there any special way I need to be using DoJSON in this case?

after discussing IRL, turns out the new API that includes CDS conversion was not used.

david-caro · 2017-12-13T12:15:30Z

hepcrawl/spiders/cds_spider.py

 from ..utils import ParsedItem

+logger = logging.getLogger(__name__)



two spaces between imports and globals and defs/class, and globals should be all caps.

david-caro · 2017-12-13T12:33:07Z

hepcrawl/spiders/cds_spider.py

+        with app.app_context():
+            json_record = hep.do(record)
+            base_uri = self.settings['SCHEMA_BASE_URI']
+            json_record['$schema'] = base_uri + 'hep.json'


os.path.join

david-caro · 2017-12-13T12:33:54Z

hepcrawl/spiders/oaipmh_spider.py

+# -*- coding: utf-8 -*-
+#
+# This file is part of hepcrawl.
+# Copyright (C) 2015, 2016, 2017 CERN.


year here is just 2017

david-caro · 2017-12-13T12:34:25Z

hepcrawl/spiders/oaipmh_spider.py

+from scrapy.http import Request, XmlResponse
+from scrapy.selector import Selector
+from . import StatefulSpider
+


double spaces and uppercase globals

david-caro · 2017-12-13T12:46:35Z

hepcrawl/spiders/oaipmh_spider.py

+    next harvest.
+    """
+    name = 'OAI-PMH'
+    granularity = _Granularity.DATE


Let's drop the granularity stuff, we can always implement it later if needed.

david-caro · 2017-12-13T13:07:39Z

hepcrawl/spiders/oaipmh_spider.py

+
+    def _make_alias(self):
+        return '{url}?metadataPrefix={metadata_prefix}&set={set}'.format(
+            url=self.url,


don't use the url for the alias

david-caro · 2017-12-13T13:07:54Z

hepcrawl/spiders/oaipmh_spider.py

+        self.metadata_prefix = metadata_prefix
+        self.set = oai_set
+        self.granularity = granularity
+        self.alias = alias or self._make_alias()


This property is not needed

michamos · 2017-12-13T13:13:51Z

hepcrawl/spiders/cds_spider.py

-        except Exception:
-            logger.exception("Error when parsing record")
-            return None
+        record = create_record(selector.xpath('.//record').extract()[0])


extract()[0] is equivalent to extract_first() (the former raises if the list is empty, the latter returns None but can be overridden).

david-caro · 2017-12-13T13:14:26Z

hepcrawl/spiders/oaipmh_spider.py

+        self.alias = alias or self._make_alias()
+        self.from_date = from_date
+        self.until_date = until_date
+        self.record_class = record_class


If this record_class is not needed we should remove it.

david-caro · 2017-12-13T13:18:58Z

hepcrawl/spiders/oaipmh_spider.py

+                last_run = json.load(f)
+                logger.info('Last run file loaded: {}'.format(repr(last_run)))
+                return last_run
+        except IOError:


Here check only if the file exists (first run) but any other errors should raise

And if it's not there, raise an exception (a subclass of Exception with a meaningful name), then try-catch it in whomever calls this.

Signed-off-by: Szymon Łopaciuk <[email protected]>

Harvests CDS through dojson directly: closes inspirehep#199. Signed-off-by: Szymon Łopaciuk <[email protected]>

Signed-off-by: Szymon Łopaciuk <[email protected]>

Fixes a bug that would raise a TypeError when no last runs would have been present. Signed-off-by: Szymon Łopaciuk <[email protected]>

According to the docs the spider will only be closed after CLOSESPIDER_ERRORCOUNT number of exceptions, which by default allows ininitely many. Thus this is not needed, and it's better if we know if there are validation issues. Signed-off-by: Szymon Łopaciuk <[email protected]>

Signed-off-by: Szymon Łopaciuk <[email protected]>

Create an exception for when a last funs file doesn't exist. Signed-off-by: Szymon Łopaciuk <[email protected]>

Signed-off-by: Szymon Łopaciuk <[email protected]>

old

michamos · 2018-01-24T15:16:03Z

hepcrawl/spiders/cds_spider.py

+            self.settings.getdict('MARC_TO_HEP_SETTINGS', {})
+        )
+        with app.app_context():
+            json_record = hep.do(record)


this shouldn't use hep.do but the new marcxml2record API. Otherwise, the CDS conversion is not triggered.

I think I also added this in the last PR about arXiv, as it was failing the tests, but 👍 if we don't need it.

szymonlopaciuk · 2018-01-24T17:18:45Z

Superseded by #216

ghost assigned kaplun Dec 7, 2017

ghost added in progress labels Dec 7, 2017

michamos mentioned this pull request Dec 7, 2017

Directly harvest oai sources #197

Closed

david-caro assigned szymonlopaciuk Dec 7, 2017

kaplun commented Dec 7, 2017

View reviewed changes

michamos reviewed Dec 7, 2017

View reviewed changes

david-caro reviewed Dec 7, 2017

View reviewed changes

szymonlopaciuk force-pushed the cds2 branch 2 times, most recently from 1a529d1 to b500aa0 Compare December 7, 2017 16:19

szymonlopaciuk changed the title ~~WIP~~ CDS: harvest directly from OAI-PMH Dec 8, 2017

szymonlopaciuk force-pushed the cds2 branch 3 times, most recently from c6fbed0 to 8ffacc7 Compare December 12, 2017 16:11

szymonlopaciuk added Status: Ready for review and removed Status: WIP labels Dec 12, 2017

szymonlopaciuk force-pushed the cds2 branch from 8ffacc7 to c53d175 Compare December 13, 2017 09:27

david-caro reviewed Dec 13, 2017

View reviewed changes

michamos reviewed Dec 13, 2017

View reviewed changes

david-caro reviewed Dec 13, 2017

View reviewed changes

szymonlopaciuk added 18 commits January 16, 2018 13:52

spiders: OAI-PMH: continue where left off

4890aa1

Signed-off-by: Szymon Łopaciuk <[email protected]>

use celerymonitor in CDS tests

80efc44

Signed-off-by: Szymon Łopaciuk <[email protected]>

CDS spider: drop HarvestingKit (inspirehep#199)

adb1906

Harvests CDS through dojson directly: closes inspirehep#199. Signed-off-by: Szymon Łopaciuk <[email protected]>

remove unused import

fff7c95

Signed-off-by: Szymon Łopaciuk <[email protected]>

fix failure on lack of last runs file

4895b07

Fixes a bug that would raise a TypeError when no last runs would have been present. Signed-off-by: Szymon Łopaciuk <[email protected]>

style fixes

bb5c834

Signed-off-by: Szymon Łopaciuk <[email protected]>

bump inspire-dojson~=57.0,>=57.1

acf9125

Signed-off-by: Szymon Łopaciuk <[email protected]>

remove record_class field, as Record is default

9a4f285

Signed-off-by: Szymon Łopaciuk <[email protected]>

use os.path.json in cds_spider

077c1f1

Signed-off-by: Szymon Łopaciuk <[email protected]>

remove url from the last_run file hash

054aa0b

Signed-off-by: Szymon Łopaciuk <[email protected]>

remove granularity, default to YYYY-MM-DD for now

b3159f7

Signed-off-by: Szymon Łopaciuk <[email protected]>

refactor tests

10804f7

Signed-off-by: Szymon Łopaciuk <[email protected]>

stricter error catching when loading last_runs

5851258

Create an exception for when a last funs file doesn't exist. Signed-off-by: Szymon Łopaciuk <[email protected]>

leave only a few test records, remove the rest

23c3d90

Signed-off-by: Szymon Łopaciuk <[email protected]>

tests: naming nad don't load directly from file

332071f

Signed-off-by: Szymon Łopaciuk <[email protected]>

make parse_record abstract

a96f3c4

Signed-off-by: Szymon Łopaciuk <[email protected]>

spiders: move Statetul and OAI to common module

6b7d886

Signed-off-by: Szymon Łopaciuk <[email protected]>

szymonlopaciuk force-pushed the cds2 branch from 1108829 to 6b7d886 Compare January 16, 2018 13:06

ghost added the in progress label Jan 16, 2018

szymonlopaciuk added Status: Ready and removed Status: Ready for review labels Jan 17, 2018

szymonlopaciuk mentioned this pull request Jan 17, 2018

IOP Spider: improve and add tests #206

Open

8 tasks

david-caro unassigned kaplun Jan 19, 2018

michamos reviewed Jan 24, 2018

View reviewed changes

szymonlopaciuk mentioned this pull request Jan 24, 2018

cds: use the OAI-PMH spider to harvest CDS #216

Closed

8 tasks

szymonlopaciuk closed this Jan 24, 2018

		from scrapy.http import Response


		class DummyDownloadHandler(object):

		from ..utils import ParsedItem

		logger = logging.getLogger(__name__)

CDS: harvest directly from OAI-PMH #198

CDS: harvest directly from OAI-PMH #198

Conversation

kaplun commented Dec 7, 2017 • edited by szymonlopaciuk Loading

Description

Issues

Checklist

david-caro commented Dec 7, 2017

Choose a reason for hiding this comment

david-caro Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michamos Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szymonlopaciuk Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szymonlopaciuk commented Jan 24, 2018

kaplun commented Dec 7, 2017 •

edited by szymonlopaciuk

Loading

david-caro Dec 7, 2017 •

edited

Loading

michamos Dec 7, 2017 •

edited

Loading

szymonlopaciuk Dec 7, 2017 •

edited

Loading