Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDS: harvest directly from OAI-PMH #198

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
fad1b50
create a OAI-PMH spider to use in CDS spider
kaplun Oct 10, 2017
33c3ae5
refactor, test contents
szymonlopaciuk Dec 7, 2017
db2953f
parse_record takes the selector
szymonlopaciuk Dec 8, 2017
4890aa1
spiders: OAI-PMH: continue where left off
szymonlopaciuk Dec 8, 2017
80efc44
use celerymonitor in CDS tests
szymonlopaciuk Dec 12, 2017
adb1906
CDS spider: drop HarvestingKit (#199)
szymonlopaciuk Dec 12, 2017
fff7c95
remove unused import
szymonlopaciuk Dec 12, 2017
4895b07
fix failure on lack of last runs file
szymonlopaciuk Dec 13, 2017
b7c3fc4
remove ignoring the exception on item validation
szymonlopaciuk Dec 13, 2017
bb5c834
style fixes
szymonlopaciuk Dec 13, 2017
acf9125
bump inspire-dojson~=57.0,>=57.1
szymonlopaciuk Dec 14, 2017
9a4f285
remove record_class field, as Record is default
szymonlopaciuk Dec 14, 2017
077c1f1
use os.path.json in cds_spider
szymonlopaciuk Dec 14, 2017
054aa0b
remove url from the last_run file hash
szymonlopaciuk Dec 14, 2017
b3159f7
remove granularity, default to YYYY-MM-DD for now
szymonlopaciuk Dec 14, 2017
10804f7
refactor tests
szymonlopaciuk Dec 14, 2017
5851258
stricter error catching when loading last_runs
szymonlopaciuk Dec 14, 2017
23c3d90
leave only a few test records, remove the rest
szymonlopaciuk Dec 14, 2017
332071f
tests: naming nad don't load directly from file
szymonlopaciuk Dec 14, 2017
a96f3c4
make parse_record abstract
szymonlopaciuk Dec 14, 2017
6b7d886
spiders: move Statetul and OAI to common module
szymonlopaciuk Jan 16, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions docker-compose.test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ services:
- APP_CRAWLER_HOST_URL=http://scrapyd:6800
- APP_API_PIPELINE_TASK_ENDPOINT_DEFAULT=hepcrawl.testlib.tasks.submit_results
- APP_FILES_STORE=/tmp/file_urls
- APP_LAST_RUNS_PATH=/code/.scrapy/last_runs
- APP_CRAWL_ONCE_PATH=/code/.scrapy
- COVERAGE_PROCESS_START=/code/.coveragerc
- BASE_USER_UID=${BASE_USER_UID:-1000}
Expand Down Expand Up @@ -58,8 +59,11 @@ services:
functional_cds:
<<: *service_base
command: py.test -vv tests/functional/cds
links:
- scrapyd
depends_on:
scrapyd:
condition: service_healthy
cds-http-server.local:
condition: service_healthy

functional_pos:
<<: *service_base
Expand Down Expand Up @@ -129,6 +133,21 @@ services:
- "CMD-SHELL"
- "curl https://localhost:443/"

cds-http-server.local:
image: nginx:stable-alpine
volumes:
- ${PWD}/tests/functional/cds/fixtures/http_server/conf/proxy.conf:/etc/nginx/conf.d/default.conf
- ${PWD}/tests/functional/cds/fixtures/http_server/records:/etc/nginx/html/
ports:
- 80:80
healthcheck:
timeout: 5s
interval: 5s
retries: 5
test:
- "CMD-SHELL"
- "curl http://localhost:80/"

rabbitmq:
image: rabbitmq
healthcheck:
Expand Down
22 changes: 22 additions & 0 deletions hepcrawl/downloaders.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# -*- coding: utf-8 -*-
#
# This file is part of hepcrawl.
# Copyright (C) 2016, 2017 CERN.
#
# hepcrawl is a free software; you can redistribute it and/or modify it
# under the terms of the Revised BSD License; see LICENSE file for
# more details.

"""Additional downloaders."""


from scrapy.http import Response


class DummyDownloadHandler(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be needed because for oai-pmh, you pass oaipmh+http(s) in the url. Cannot you just pass the real URL starting with http(s) instead and skip this whole download handler business?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we're using Sickle to do OAI stuff, and it does all of it's downloading on it's own, only exposing an iterator on the records. Since it doesn't expose any of it's requests/responses, there is not a nice way to integrate it with Scrapy so we bypass it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO nope, because scrapy would otherwise try to fetch the URL with its default methods thus not allowing me to pass the ball to sickle.

Copy link
Contributor

@michamos michamos Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLFeedSpider seems to have no such problems... OK, I understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so IIUC, this whole contraption is done so that in the end, start_requests calls parse. If that's all you want, cannot you call parse from start_requests directly, given that you don't care about the response?

Copy link
Contributor

@szymonlopaciuk szymonlopaciuk Dec 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying something like this yesterday and was having issues. I didn't investigate very deeply, but I think Scrapy needs a request to begin crawling (either through start_requests or url). It expects start_requests to return a Request and doesn't allow anything else.

def __init__(self, *args, **kwargs):
pass

def download_request(self, request, spider):
url = request.url
return Response(url, request=request)
15 changes: 15 additions & 0 deletions hepcrawl/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@

from __future__ import absolute_import, division, print_function

from scrapy.settings import default_settings

import os


Expand All @@ -40,6 +42,12 @@
'http://localhost/schemas/records/'
)

# Location of last run information
LAST_RUNS_PATH = os.environ.get(
'APP_LAST_RUNS_PATH',
'/var/lib/scrapy/last_runs/'
)

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS=32

Expand Down Expand Up @@ -71,6 +79,13 @@
'hepcrawl.middlewares.HepcrawlCrawlOnceMiddleware': 100,
}

# Configure custom downloaders
# See https://doc.scrapy.org/en/0.20/topics/settings.html#download-handlers
DOWNLOAD_HANDLERS = {
'oaipmh+http': 'hepcrawl.downloaders.DummyDownloadHandler',
'oaipmh+https': 'hepcrawl.downloaders.DummyDownloadHandler',
}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
Expand Down
8 changes: 0 additions & 8 deletions hepcrawl/spiders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,3 @@
# more details.

from __future__ import absolute_import, division, print_function

from scrapy import Spider


class StatefulSpider(Spider):
def __init__(self, *args, **kwargs):
self.state = {}
return super(Spider, self).__init__(*args, **kwargs)
2 changes: 1 addition & 1 deletion hepcrawl/spiders/alpha_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from scrapy import Request
from scrapy.spiders import CrawlSpider

from . import StatefulSpider
from .common import StatefulSpider
from ..items import HEPRecord
from ..loaders import HEPLoader
from ..utils import (
Expand Down
2 changes: 1 addition & 1 deletion hepcrawl/spiders/aps_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

from scrapy import Request

from . import StatefulSpider
from .common import StatefulSpider
from ..items import HEPRecord
from ..loaders import HEPLoader
from ..utils import (
Expand Down
2 changes: 1 addition & 1 deletion hepcrawl/spiders/arxiv_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from scrapy import Request, Selector
from scrapy.spiders import XMLFeedSpider

from . import StatefulSpider
from .common import StatefulSpider
from ..items import HEPRecord
from ..loaders import HEPLoader
from ..mappings import CONFERENCE_WORDS, THESIS_WORDS
Expand Down
2 changes: 1 addition & 1 deletion hepcrawl/spiders/base_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from scrapy import Request
from scrapy.spiders import XMLFeedSpider

from . import StatefulSpider
from .common import StatefulSpider
from ..items import HEPRecord
from ..loaders import HEPLoader
from ..utils import (
Expand Down
2 changes: 1 addition & 1 deletion hepcrawl/spiders/brown_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from scrapy import Request
from scrapy.spiders import CrawlSpider

from . import StatefulSpider
from .common import StatefulSpider
from ..items import HEPRecord
from ..loaders import HEPLoader
from ..utils import (
Expand Down
84 changes: 36 additions & 48 deletions hepcrawl/spiders/cds_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,67 +9,55 @@

"""Spider for the CERN Document Server OAI-PMH interface"""

from scrapy.spider import XMLFeedSpider
from scrapy import Request
from harvestingkit.inspire_cds_package.from_cds import CDS2Inspire
from harvestingkit.bibrecord import (
create_record as create_bibrec,
record_xml_output,
)
from dojson.contrib.marc21.utils import create_record
from inspire_dojson.hep import hep
import logging
from flask.app import Flask
from inspire_dojson import marcxml2record
from os.path import join as path_join

from . import StatefulSpider
from .common import OAIPMHSpider
from ..utils import ParsedItem


class CDSSpider(StatefulSpider, XMLFeedSpider):
LOGGER = logging.getLogger(__name__)


class CDSSpider(OAIPMHSpider):
"""Spider for crawling the CERN Document Server OAI-PMH XML files.

Example:
Using OAI-PMH XML files::

$ scrapy crawl \\
cds \\
-a "source_file=file://$PWD/tests/functional/cds/fixtures/oai_harvested/cds_smoke_records.xml"
$ scrapy crawl CDS \\
-a "oai_set=forINSPIRE" -a "from_date=2017-10-10"

It uses `HarvestingKit <https://pypi.python.org/pypi/HarvestingKit>`_ to
translate from CDS's MARCXML into INSPIRE Legacy's MARCXML flavor. It then
employs `inspire-dojson <https://pypi.python.org/pypi/inspire-dojson>`_ to
transform the legacy INSPIRE MARCXML into the new INSPIRE Schema.
It uses `inspire-dojson <https://pypi.python.org/pypi/inspire-dojson>`_ to
translate from CDS's MARCXML into the new INSPIRE Schema.
"""

name = 'CDS'
iterator = 'xml'
itertag = 'OAI-PMH:record'
namespaces = [
('OAI-PMH', 'http://www.openarchives.org/OAI/2.0/'),
('marc', 'http://www.loc.gov/MARC21/slim'),
]

def __init__(self, source_file=None, **kwargs):
super(CDSSpider, self).__init__(**kwargs)
self.source_file = source_file

def start_requests(self):
yield Request(self.source_file)
def __init__(self,
oai_endpoint='http://cds.cern.ch/oai2d',
from_date=None,
oai_set="forINSPIRE",
*args, **kwargs):
super(CDSSpider, self).__init__(
url=oai_endpoint,
metadata_prefix='marcxml',
oai_set=oai_set,
from_date=from_date,
**kwargs
)

def parse_node(self, response, node):
node.remove_namespaces()
cds_bibrec, ok, errs = create_bibrec(
node.xpath('.//record').extract()[0]
def parse_record(self, selector):
selector.remove_namespaces()
record = selector.xpath('.//record').extract_first()
app = Flask('hepcrawl')
app.config.update(
self.settings.getdict('MARC_TO_HEP_SETTINGS', {})
)
if not ok:
raise RuntimeError("Cannot parse record %s: %s", node, errs)
self.logger.info("Here's the record: %s" % cds_bibrec)
inspire_bibrec = CDS2Inspire(cds_bibrec).get_record()
marcxml_record = record_xml_output(inspire_bibrec)
record = create_record(marcxml_record)
json_record = hep.do(record)
base_uri = self.settings['SCHEMA_BASE_URI']
json_record['$schema'] = base_uri + 'hep.json'
parsed_item = ParsedItem(
record=json_record,
record_format='hep',
)
return parsed_item
with app.app_context():
json_record = marcxml2record(record)
base_uri = self.settings['SCHEMA_BASE_URI']
json_record['$schema'] = path_join(base_uri, 'hep.json')
return ParsedItem(record=json_record, record_format='hep')
13 changes: 13 additions & 0 deletions hepcrawl/spiders/common/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# -*- coding: utf-8 -*-
#
# This file is part of hepcrawl.
# Copyright (C) 2015, 2016, 2017, 2018 CERN.
#
# hepcrawl is a free software; you can redistribute it and/or modify it
# under the terms of the Revised BSD License; see LICENSE file for
# more details.

from __future__ import absolute_import, division, print_function

from .oaipmh_spider import OAIPMHSpider
from .stateful_spider import StatefulSpider
Loading