Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 coding problems OAI-PMH? #9987

Closed
lmaylein opened this issue Oct 10, 2023 · 5 comments
Closed

UTF8 coding problems OAI-PMH? #9987

lmaylein opened this issue Oct 10, 2023 · 5 comments

Comments

@lmaylein
Copy link
Contributor

What steps does it take to reproduce the issue?

The OAI request
https://heidata.uni-heidelberg.de/oai?verb=ListRecords&resumptionToken=b2Zmc2V0OjozMHxwcmVmaXg6Om9haV9kZGk=
results in a non-valid XML. Apparently the apostrophe in the author name for https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/10034 (Siang, Ch’ng Kean) is not encoded correctly.

Firefox reports that the xml is not well-formed. And also jhove reports an error:

If I export the metadata directly to an XML format (for example https://heidata.uni-heidelberg.de/api/datasets/export?exporter=Datacite&persistentId=doi%3A10.11588/data/10034), the encoding is apparently correct.


jhove -m XML-hul oai.xml
Jhove (Rel. 1.20.0, 2019-01-19)
 Date: 2023-10-10 08:30:46 CEST
 RepresentationInformation: oai.xml
  ReportingModule: XML-hul, Rel. 1.4 (2007-01-08)
  LastModified: 2023-10-10 08:29:58 CEST
  Size: 75357
  Format: XML
  Status: Not well-formed
  SignatureMatches:
   XML-hul
  ErrorMessage: Invalid byte 1 of 1-byte UTF-8 sequence.: Line = 3, Column = 17702
  MIMEtype: text/xml
  • When does this issue occur?

OAI-PMH requests. Maybe all UTF-8-Codepoints with more then two bytes?

  • Which page(s) does it occurs on?

https://heidata.uni-heidelberg.de/oai?verb=ListRecords&resumptionToken=b2Zmc2V0OjozMHxwcmVmaXg6Om9haV9kZGk=

  • What happens?

see above

  • To whom does it occur (all users, curators, superusers)?

All OAI-PMH users

  • What did you expect to happen?

Same coding as for frontend and other exports.

Which version of Dataverse are you using?

5.13

@poikilotherm
Copy link
Contributor

I suspect this is a duplicate of #9910

An upstream lib fix is in preparation (gdcc/xoai#192), but needs approval from @landreev or @pdurbin. Further feedback on that PR is very welcome!

@landreev
Copy link
Contributor

landreev commented Oct 11, 2023

Yes, this is definitely a duplicate of #9910.
The diagnostic test is to check if the multi-byte UTF8 character in question is split at 1024 byte offset in the metadata export in question, oai_ddi in your case:

curl "https://heidata.uni-heidelberg.de/api/datasets/export?exporter=oai_ddi&persistentId=doi%3A10.11588/data/10034" | head -c1024

<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>On the role of social wage comparisons in gift-exchange experiments [Dataset]</titl><IDNo agency="DOI">doi:10.11588/data/10034</IDNo></titlStmt><distStmt><distrbtr source="archive">heiDATA</distrbtr><distDate>2014-11-05</distDate></distStmt><verStmt source="archive"><version date="2017-04-06" type="RELEASED">2</version></verStmt><biblCit>Siang, Ch’ng Kean; Requate, Till; Waichman, Israel, 2014, "On the role of social wage comparisons in gift-exchange experiments [Dataset]", https://doi.org/10.11588/data/10034, heiDATA, V2</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>On the role of social wage comparisons in gift-exchange experiments [Dataset]</titl><IDNo agency="DOI">doi:10.11588/data/10034</IDNo></titlStmt><rspStmt><AuthEnty>Siang, Ch?

Yes, we have a fix for this in the xoai library - thanks to @poikilotherm! - and we should be able to incorporate it into Dataverse shortly. However, it will only become part of a Dataverse release as of 6.1, which is still a couple of months away.
One way to fix it on your instance before 6.1 becomes available would be to upgrade to 6.0, and then replace one of the xoai jar files that come with it with the latest, fixed version (I was considering announcing this problem, and this workaround in the Google group, once the xoai fix is officially released).
Unfortunately, the current versions of xoai are not going to work with pre-6.0 versions of Dataverse.

@pdurbin
Copy link
Member

pdurbin commented Oct 13, 2023

Well, those two guys would know! Closing as a duplicate of this issue:

@pdurbin pdurbin closed this as completed Oct 13, 2023
@landreev
Copy link
Contributor

#9910 has been fixed and closed. Also, I'd like to point out that what I said earlier - "Unfortunately, the current versions of xoai are not going to work with pre-6.0 versions of Dataverse" was not true (see the linked comment below):

#9910 (comment)

@lmaylein
Copy link
Contributor Author

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants