Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML declaration missing from OAI when using oai_ddi #10329

Closed
plesubc opened this issue Feb 16, 2024 · 6 comments
Closed

XML declaration missing from OAI when using oai_ddi #10329

plesubc opened this issue Feb 16, 2024 · 6 comments
Labels
Feature: Harvesting Size: 3 A percentage of a sprint. 2.1 hours. Type: Bug a defect

Comments

@plesubc
Copy link

plesubc commented Feb 16, 2024

XML declarations missing from metadataPrefix=oai_ddi records.

What steps does it take to reproduce the issue?

An OAI harvest on a record using oai_ddi. Generic example: https://[DV_URL]/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.5683/SUM/IDENT

  • When does this issue occur?

On OAI harvest as above.

  • Which page(s) does it occurs on?

On every occurence.

  • What happens?

Record is missing mandatory xml declaration as in the OAI spec section 3.2.1 as per https://www.openarchives.org/OAI/openarchivesprotocol.html

Because of this, records may cause an error XML Parsing Error: not well-formed when encountering non-ASCII characters, causing problems with OAI harvest.

  • To whom does it occur (all users, curators, superusers)?

This would (presumably) affect all records which contain characters outside of ISO-8859-1

  • What did you expect to happen?

XML was expected to be generated without error (notably the DDI export found in the API and Dataverse GUI contains an XML declaration).

Which version of Dataverse are you using?

v5.13 (at https://borealisdata.ca)


As an example of this, here is the output of
https://borealisdata.ca/oai?verb=GetRecord&identifier=doi:10.5683/SP2/NEPRTA&metadataPrefix=oai_ddi (2024-02-16, probably repaired by the time you see it), original record at

https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0

XML Parsing Error: not well-formed
Location: https://borealisdata.ca/oai?verb=GetRecord&identifier=doi:10.5683/SP2/NEPRTA&metadataPrefix=oai_ddi
Line Number 1, Column 1622:

The character which causes the failure is the single typographic quote in the title: https://www.codetable.net/decimal/8217

Note that the content of the page is as follows, and is missing the XML declaration:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2024-02-16T19:15:58Z</responseDate><request verb="GetRecord" identifier="doi:10.5683/SP2/NEPRTA" metadataPrefix="oai_ddi">https://borealisdata.ca/oai</request><GetRecord><record><header><identifier>doi:10.5683/SP2/NEPRTA</identifier><datestamp>2023-02-02T07:00:48Z</datestamp><setSpec>SP</setSpec><setSpec>sp_dataverse</setSpec><setSpec>ubc_dataverse</setSpec></header><metadata><codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering��� metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo><IDNo agency="Dryad">doi:10.5061/dryad.41752p5</IDNo></titlStmt><rspStmt><AuthEnty affiliation="University of British Columbia">Germain, Rachel M.</AuthEnty><AuthEnty affiliation="University of Queensland">Mayfield, Margaret M.</AuthEnty><AuthEnty affiliation="University of Toronto">Gilbert, Benjamin</AuthEnty></rspStmt><prodStmt/><distStmt><distrbtr source="archive">Borealis</distrbtr><contact>UBC Library Research Data Team</contact><distDate>2018-07-30</distDate><depDate>2020-06-30</depDate></distStmt><holdings URI="https://doi.org/10.5683/SP2/NEPRTA"/></citation><stdyInfo><subject><keyword xml:lang="en">Other</keyword><keyword vocab="Dryad">annual plants</keyword><keyword vocab="Dryad">fitness differences</keyword><keyword vocab="Dryad">Holocene</keyword></subject><abstract date="2020-06-30">&lt;b>Abstract&lt;/b>&lt;br/>‘Filtering’, or the reduction in species diversity that occurs because not all species can persist in all locations, is thought to unfold hierarchically, controlled by the environment at large scales and competition at small scales. However, the ecological effects of competition and the environment are not independent, and observational approaches preclude investigation into their interplay. We use a demographic approach with 30 plant species to experimentally test (i) the effect of competition on species persistence in two soil moisture environments, and (ii) the effect of environmental conditions on mechanisms underlying competitive coexistence. We find that competitors cause differential species persistence across environments even when effects are lacking in the absence of competition, and that the traits that determine persistence depend on the competitive environment. If our study had been observational and trait-based, we would have erroneously concluded that the environment filters species with low biomass, shallow roots, and small seeds. Changing environmental conditions generated idiosyncratic effects on coexistence outcomes, increasing competitive exclusion of some species while promoting coexistence of others. Our results highlight the importance of considering environmental filtering in light of, rather than in isolation from, competition, and challenge community assembly models and approaches to projecting future species distributions.</abstract><abstract date="2020-06-30">&lt;b>Usage notes&lt;/b>&lt;br />&lt;div class="o-metadata__file-usage-entry">&lt;h4 class="o-heading__level3-file-title">Germain BL data&lt;/h4>&lt;div class="o-metadata__file-description">First worksheet includes the demographic data, second worksheet the trait data. Species codes are expanded in the supplementary materials.&lt;/div>&lt;div class="o-metadata__file-name">&lt;/div>&lt;/div></abstract><sumDscr><geogCover>California</geogCover></sumDscr><notes>&lt;p>&lt;b>Dryad version number:&lt;/b> 1&lt;/p>
&lt;p>&lt;b>Version status:&lt;/b> submitted&lt;/p>
&lt;p>&lt;b>Dryad curation status:&lt;/b> Published&lt;/p>
&lt;p>&lt;b>Sharing link:&lt;/b> https://datadryad.org/stash/share/bEgp01tBpt-ctVM-ZfFa0KdOQT1nXE5FT-DnIRgymho&lt;/p>
&lt;p>&lt;b>Storage size:&lt;/b> 45413&lt;/p>
&lt;p>&lt;b>Visibility:&lt;/b> public&lt;/p></notes></stdyInfo><method><dataColl><sources/></dataColl><anlyInfo/></method><dataAccs><notes type="DVN:TOU" level="dv">This dataset is made available under a Creative Commons CC0 license with the following additional/modified terms and conditions: CC0 Waiver</notes><setAvail/><useStmt/></dataAccs><othrStdyMat><relPubl><citation><biblCit>Article</biblCit></citation><ExtLink URI="https://doi.org/10.1098/rsbl.2018.0460"/></relPubl></othrStdyMat></stdyDscr><otherMat ID="f153762" URI="https://borealisdata.ca/api/access/datafile/153762" level="datafile"><labl>dryad_41752p5.json</labl><txt>Original JSON from Dryad</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/plain;charset=UTF-8</notes></otherMat><otherMat ID="f153761" URI="https://borealisdata.ca/api/access/datafile/153761" level="datafile"><labl>Germain BL data.tab</labl><txt>First worksheet includes the demographic data, second worksheet the trait data. Species codes are expanded in the supplementary materials.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/tab-separated-values</notes></otherMat></codeBook></metadata></record></GetRecord></OAI-PMH>
@plesubc plesubc added the Type: Bug a defect label Feb 16, 2024
@cmbz cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Mar 14, 2024
@cmbz cmbz added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 14, 2024
@cmbz cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Mar 14, 2024
@landreev
Copy link
Contributor

landreev commented Mar 14, 2024

I may be missing something obvious, but why do you think that the problem above is due to a missing xml declaration? Oh, I see what you mean now [edit: no, that's not what the OP meant] - the xml declaration that is present in the export, but absent in the OAI output. This is in fact "a feature, not a bug": these declarations are stripped on purposes when generating the OAI output. That xml header would in fact make the OAI xml invalid if left in place.

To me it looks like the "XML Parsing Error" in your example is due to the invalid UTF8 characters in the output (after the word "filtering"). I also suspect that it's a result of this bug:
#9910
which has since been fixed (in 6.1; can be patched in a previous Dataverse version by dropping the updated OAI library jar in place).

But please note that this is just a guess, we would need to confirm this.

probably repaired by the time you see it

Any chance you could point me to an OAI record that is still similarly broken?

@plesubc
Copy link
Author

plesubc commented Mar 14, 2024

OAI output requires the declaration as cited above in 3.2.1 of the spec.

The first tag output is an XML declaration where the version is always 1.0 and the encoding is always UTF-8, eg: <?xml version="1.0" encoding="UTF-8" ?

For example, this is missing the declaration:
https://borealisdata.ca/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.5683/SP2/NEPRTA

It is also the record that caused the chaos:
https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA
More specifically, Version 1 of the record contains the "right single quotation mark" which caused consternation.

I can't point you to a similar record that's broken, but you should be able to reproduce it by copying over the metadata from version 1 to wherever you test things: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0

I'm not sure how you get "That xml header would in fact make the OAI xml invalid if left in place." Using the GetRecord verb and having the declaration in place should not invalidate the XML, unless I'm missing something.

"An example of a successful reply to the GetRecord request shown above is of the form:"

<?xml version="1.0" encoding="UTF-8" ?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2002-05-01T19:20:30Z</responseDate>
 <request verb="GetRecord" identifier="oai:arXiv.org:hep-th/9901001"
          metadataPrefix="oai_dc">http://an.oa.org/OAI-script</request> 
 <GetRecord>
  <record>
      ...
  </record>
 </GetRecord> 
</OAI-PMH>  

The GetRecord response from Dataverse does not conform to this model.

I didn't write whatever parsed the XML output from Borealisdata.ca. But I did have to find out what was causing the problem in the record, which I then traced to the offending UTF-8 character.

I realize that XML should explicitly be assumed to be UTF-8, but:

  • I didn't write the software the people in my library were using for parsing
  • They were looking for the encoding as required in the OAI spec
  • The declaration block was not present
  • For whatever reason the parser then choked on a UTF-8 character because the encoding was missing

As far as I can tell, inserting one line whenever the GetRecord OAI verb is used should be enough to make it conform to the spec. Oh and it's also missing from ListRecords, and possibly from other places although I haven't made an exhaustive search.

@landreev
Copy link
Contributor

OAI output requires the declaration as cited above in 3.2.1 of the spec.
The first tag output is an XML declaration where the version is always 1.0 and the encoding is always UTF-8, eg: <?xml version="1.0" encoding="UTF-8" ?

The sentence quoted above refers to the xml declaration at the top of the full GetRecord XML output itself... So, we were talking about different things (I was referring to our code going to some trouble stripping these headers somewhere else). But, do note that I opened with an acknowledgment that it was possible I was missing something.

However, having taken another look, I can tell you 100% for sure that the xml error in your original example is most definitely the result of the bug I mentioned (#9910). I can send you more info about that bug; and I will otherwise look into this some more tomorrow.

My apologies for having missed this issue when you opened it last months.

@landreev
Copy link
Contributor

Sorry for adding unnecessary confusion the other day. I can point you to the specific place where that xml declaration is in fact stripped from the output (inside the <metadata>...</metadata> blocks), but no, that's not relevant to the case at hand.

There are two separate things going on:

  1. You appear to be entirely correct about the OAI-PMH spec requiring the xml declaration. Somehow nobody has noticed this over the years in our OAI output. I kept saying "our code" but, strictly speaking, this output is generated in a third party library (xoai). But it is now maintained by a member of the Dataverse core team and I'll be talking to them about this. We can usually make any changes there and incorporate them into Dataverse fairly quickly.
  2. The absence of this header is NOT what's causing the problem presented in the opening comment. Note that the borealis.ca Dataverse instance almost certainly has numerous other metadata fragments with UTF8 characters, including that specific non-ASCII quote - please do note that it occurs in multiple other places in the record in your example, being properly displayed! - and the OAI records produced for most of them are perfectly fine, well-formed and parsable. What makes the specific record in your example not well-formed is the presence of junk bytes - binary characters not forming valid UTF8 sequences. In the quoted output they are turned into UTF8 "invalid character" symbols: ...filtering��� metaphor ... . I understand that it was reasonable to assume that it was the other way around - that the absence of the xml declaration turned a UTF8 character into invalid bytes - but no, that is not the case. The telltale sign/the diagnostic test of this problem being an instance of the peculiar bug I mentioned (9910) is that the garbage sequence occurs at precisely the 1024 byte offset in the metadata fragment:
echo -n '<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering�'  |  wc
       0      54    1026
echo -n '<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence</titl><IDNo agency="DOI">doi:10.5683/SP2/NEPRTA</IDNo></titlStmt><distStmt><distrbtr source="archive">Borealis</distrbtr><distDate>2021-05-19</distDate></distStmt><verStmt source="archive"><version date="2021-05-19" type="RELEASED">1</version></verStmt><biblCit>Germain, Rachel M.; Mayfield, Margaret M.; Gilbert, Benjamin, 2018, "Data from: The ‘filtering’ metaphor revisited: competition and environment jointly structure invasibility and coexistence", https://doi.org/10.5683/SP2/NEPRTA, Borealis, V1, UNF:6:kxNp8p/4jEx8g19DieTKdA== [fileUNF]</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Data from: The ‘filtering' | wc 
       0      54    1023

That was the nature of the weird bug, it manifested itself when a multi-byte UTF8 sequence happened to straddle the 1024 bytes offset in the cached metadata record (and only that offset, not the multiples of).
If you are curious/have more time to kill, here's a long description of the bug in the xoai repo: gdcc/xoai#188
It was fixed there and the updated library was incorporated into Dataverse in the PR #10012.
Please see specifically this comment in issue 9910 on how to patch a pre-6.1 instance of Dataverse for this bug: #9910 (comment)

All the best,
-Leo

@landreev
Copy link
Contributor

I opened an issue in the xoai repo (gdcc/xoai#225) for the missing declaration. It seems somewhat redundant, since the server already sends the Content-Type: text/xml;charset=UTF-8 header to the client. But the spec does say it's needed, so I trust the maintainer of the library to make the decision as to whether it's necessary.

Otherwise I'm going to close this issue.

Once again, I am really sorry we didn't get back to you sooner on this. We communicated directly with a couple of other Dataverse instances who reported the bug last fall and helped them patch their installations. But then once it was fixed in 6.1 we just moved on, assuming that everybody would just upgrade - I'm realizing now that was a mistaken assumption.

@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 25, 2024

I don't think it's the libraries job to take care of the prolog. IMHO we'd need to make this change in Dataverse code.

See also my reply at gdcc/xoai#225 (comment)

@cmbz cmbz moved this from SPRINT READY to Done 🧹 in IQSS Dataverse Project Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Size: 3 A percentage of a sprint. 2.1 hours. Type: Bug a defect
Projects
None yet
Development

No branches or pull requests

5 participants