Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directly harvest oai sources #197

Closed
david-caro opened this issue Dec 6, 2017 · 3 comments
Closed

Directly harvest oai sources #197

david-caro opened this issue Dec 6, 2017 · 3 comments
Assignees

Comments

@david-caro
Copy link
Contributor

Currently in order to harvest oai services we rely on invenio-oaiharvester from an app like inspire to do it, save the result in xml files on disk, and then harvest those. It would be really nice if we can instead harvest directly from them.

Expected Behavior

For example, for the arxiv spider, instead of harvesting from a file, we should be able to pass a server (like http://export.arxiv.org/oai2), and the parameters for the harvest (like the date from, to and the sets to take into account).

As an example of similar functionality, you can check the inspirehep oaharvester harvest command for parameters, we can use https://sickle.readthedocs.io/en/latest/index.html too to do the oai protocol handling.

Current Behavior

There's no support to directly harvest from oai enabled services.

@michamos
Copy link
Contributor

michamos commented Dec 7, 2017

#198 looks like @kaplun's attempt at solving this issue.

@kaplun
Copy link
Contributor

kaplun commented Dec 7, 2017

Yeah was writing it to @david-caro on gitter: for the OAI-PMH crawler see: #198
I never sent it for integration because I wasn't sure for testing.
Basically the OAI-PMH spider is directly embedded into our hepcrawl project (not assuming it will be integrated into scrapy. I noticed you were pointing Szymon to it)
The PR is working. It's just not complete WRT tests.
And it's showing how to use OAI-PMH (for CDS, but would be valid for arXiv too)

szymonlopaciuk pushed a commit to kaplun/hepcrawl that referenced this issue Dec 12, 2017
Also remove old tests. Fixes inspirehep#197.

Co-authored-by: Samuele Kaplun <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
szymonlopaciuk pushed a commit to kaplun/hepcrawl that referenced this issue Dec 14, 2017
Also remove old tests. Fixes inspirehep#197.

Co-authored-by: Samuele Kaplun <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
szymonlopaciuk pushed a commit to kaplun/hepcrawl that referenced this issue Jan 16, 2018
Also remove old tests. Fixes inspirehep#197.

Co-authored-by: Samuele Kaplun <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
szymonlopaciuk pushed a commit to szymonlopaciuk/hepcrawl that referenced this issue Jan 17, 2018
Also remove old tests. Fixes inspirehep#197.

Co-authored-by: Samuele Kaplun <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
szymonlopaciuk pushed a commit to szymonlopaciuk/hepcrawl that referenced this issue Jan 24, 2018
Also remove old tests. Fixes inspirehep#197.

Signed-off-by: Szymon Łopaciuk <[email protected]>
szymonlopaciuk pushed a commit to szymonlopaciuk/hepcrawl that referenced this issue Jan 26, 2018
Also remove old tests. Fixes inspirehep#197.

Signed-off-by: Szymon Łopaciuk <[email protected]>
@szymonlopaciuk
Copy link
Contributor

Closed by #203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants