Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken OAI records for some datasets with UTF-8 characters #9910

Closed
landreev opened this issue Sep 13, 2023 · 12 comments · Fixed by #10012
Closed

Broken OAI records for some datasets with UTF-8 characters #9910

landreev opened this issue Sep 13, 2023 · 12 comments · Fixed by #10012
Assignees
Labels
Feature: Harvesting pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 3 A percentage of a sprint. 2.1 hours.
Milestone

Comments

@landreev
Copy link
Contributor

landreev commented Sep 13, 2023

(Edit: everything entered in this issue so far was a red herring; I'm deleting everything and rewriting the issue from scratch)

The issue was opened as a followup to an RT ticket from an outside institutions. They were unable to harvest the full OAI set from IQSS, because it was reliably bombing on the same invalid record.

The problem was traced to something inside the OAI library, it is being tracked in gdcc/xoai#188 in the xoai repo. I have a rough idea of how to produce a quick fix.

@cmbz cmbz added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Sep 13, 2023
@cmbz cmbz added this to the 6.1 milestone Sep 13, 2023
@cmbz cmbz moved this to Release 6.1 Proposals in IQSS Dataverse Project Sep 13, 2023
@cmbz
Copy link

cmbz commented Sep 13, 2023

2023/09/13: Added to the Global Backlog, with a very strong proposal for inclusion in release 6.1.

@landreev landreev changed the title Broken metadata exports, due to a problem with escaping of html tags. Broken OAI records for some datasets with UTF-8 characters Sep 15, 2023
@cmbz cmbz moved this from Release 6.1 Proposals to SPRINT READY in IQSS Dataverse Project Sep 18, 2023
@cmbz
Copy link

cmbz commented Sep 18, 2023

2023/09/18

  • Moving to sprint ready because work is already underway

@cmbz cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Sep 25, 2023
@cmbz
Copy link

cmbz commented Sep 25, 2023

2023/09/25: Applied the NIH GREI 2.4.1B tag to reflect value to harvesting.

@scolapasta scolapasta added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Sep 27, 2023
@landreev
Copy link
Contributor Author

Much of the discussion related to this bug happened on slack between @poikilotherm, @jggautier, @scolapasta and I, so just to add a quick summary here:

  • Julian identified the records that were broken, and I put together a quick script that went and padded the metadata exports with the right number of spaces; the total number of affected records was 50-something out of almost 80K published records. This is, obviously, not a fix, but a bandaid (somebody may have already published another dataset with another similarly broken record produced as a result... :( ).
  • I confirmed that the quick xoai patch that I produced (commit in the linked issue) works. So I have a patched xoai jar that can be deployed with 6.0 (I'm planning to do that in the next few days).
  • Oliver had a cleaner/proper solution in mind.
  • Once a proper fix is made on the xoai side and the updated version is released, I'll make a PR here for a pom file update (trivial amount of work).
  • Dataverse harvesters process one record at a time (via ListIdentifiers -> GetRecord), so it only fails for the specific records affected by the bug, when Dataverses harvest from each other. But it is a more serious problem for non-Dataverse harvesters that rely on ListRecords, because the entire harvest fails once it reaches the page that contains an affected record.

@poikilotherm
Copy link
Contributor

Just to make sure this doesn't get lost: there is an upstream lib PR waiting for review gdcc/xoai#192

@pdurbin pdurbin added Size: 3 A percentage of a sprint. 2.1 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Oct 11, 2023
@landreev
Copy link
Contributor Author

Thank you @poikilotherm, much appreciated! I was a little busy putting out some fires, but looking at the xoai pr now.

@poikilotherm
Copy link
Contributor

XOAI 5.2.0 is on it's way to Central, might take a moment until it's retrievable.

@landreev
Copy link
Contributor Author

OK, we'll move this along and update the Dataverse pom file. And then we'll just need to test/QA the result.

@landreev
Copy link
Contributor Author

Quick question: I've been assuming that the xoai jars v. 5.1+ are no longer compatible with pre-6.0 versions of Dataverse, is that actually correct?
There is this sentence in the 5.1 README:

Switching to Java 17 for compilation and testing, but keeping compatibility with Java 11 for JARs

@poikilotherm
Copy link
Contributor

Should be compatible.

@landreev
Copy link
Contributor Author

I'm going to make a quick PR incorporating xoai-5.2 into develop.
And I will also try building a patched 5.14, since there appears to be interest in that in the community.

@landreev landreev self-assigned this Oct 15, 2023
@github-project-automation github-project-automation bot moved this from SPRINT READY to Clear of the Backlog in IQSS Dataverse Project Oct 16, 2023
@pdurbin pdurbin removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Oct 17, 2023
@landreev
Copy link
Contributor Author

@DS-INRA (and everybody who may be interested) The linked PR #10012 has been merged, so the bug is now fixed for real in the develop branch.
I have also tested these newly released xoai-5.2 jars with Dataverse 5.14, and confirmed that they appear to be working, fixing this bug there as well.

To clarify, I have tested patching these jars in place, on an instance where the standard 5.14 release was deployed. I haven't tried building dataverse-5.14.war from sources with these libraries, but can't think of a reason why that wouldn't work either. If there is interest, I can produce a patched dataverse-5.14.war as well.

The library jars in question can be found on maven central - at
https://repo.maven.apache.org/maven2/io/gdcc/xoai-common/5.2.0/xoai-common-5.2.0.jar
https://repo.maven.apache.org/maven2/io/gdcc/xoai-data-provider/5.2.0/xoai-data-provider-5.2.0.jar
https://repo.maven.apache.org/maven2/io/gdcc/xoai-service-provider/5.2.0/xoai-service-provider-5.2.0.jar
https://repo.maven.apache.org/maven2/io/gdcc/xoai-xmlio/5.2.0/xoai-xmlio-5.2.0.jar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: No status
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants