Expand Data Deposit API to support additional metadata #899

raprasad · 2014-09-02T14:27:46Z

Author Name: Eleni Castro (@posixeleni)
Original Redmine Issue: 3425, https://redmine.hmdc.harvard.edu/issues/3425
Original Date: 2014-01-22

So far, according to the OJS Dataverse plugin testers surveyed with results recorded at https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodDJyd0pZdnlDZ3I5eWxnOHBmV1Q4dHc&usp=sharing the most commonly requested feature is the ability to customize which metadata fields are available as part of the data deposit form, which should be implemented in a future version. In order to support this, we will need to expand the API's metadata support beyond Dublin Core metadata. SWORD Protocol should be flexible enough for us to use other standards like DDI. At http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_creatingresource_entry the SWORDv2 spec says (emphasis added):

The client SHOULD add Dublin Core [DublinCore] terms to the Atom Entry as foreign markup (if appropriate); the terms MUST be embedded as direct children of the atom:entry element, if present.

The client MAY add any other metadata formats or foreign markup to the atom:entry element

We interpret this to mean that in addition to Dublin Core (dcterms, specifically), the SWORD spec is flexible enough to support wildly different metadata formats such as DataCite (https://www.datacite.org ), DDI (Data Documentation Initiative: http://www.ddialliance.org ), VO (Virtual Observatory: http://www.ivoa.net/documents/latest/RM.html ) ISA-Tab (Investigation, Study, and Assay in XML format: http://isatab.sourceforge.net/docs/Wiemann_SupplFile4.xml ), etc.

We're not sure if any other SWORD server implementation is going beyond dcterms, however, which is what the spec requires. We'll ask on the mailing list.

raprasad · 2014-09-02T14:27:46Z

Original Redmine Comment
Author Name: Eleni Castro (@posixeleni)
Original Date: 2014-01-23T21:12:50Z

This would line up appropriately with the metadata expansion that we are actively working on for Dataverse 4.0.

raprasad · 2014-09-02T14:27:46Z

Original Redmine Comment
Author Name: Eleni Castro (@posixeleni)
Original Date: 2014-02-10T17:56:14Z

Eleni Castro wrote:

So far, according to the OJS Dataverse plugin testers surveyed with results recorded at https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodDJyd0pZdnlDZ3I5eWxnOHBmV1Q4dHc&usp=sharing the most commonly requested feature is the ability to customize which metadata fields are available as part of the data deposit form, which should be implemented in a future version. In order to support this, we will need to expand the API's metadata support beyond Dublin Core metadata. SWORD Protocol should be flexible enough for us to use other standards like DDI. At http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_creatingresource_entry the SWORDv2 spec says (emphasis added):

The client SHOULD add Dublin Core [DublinCore] terms to the Atom Entry as foreign markup (if appropriate); the terms MUST be embedded as direct children of the atom:entry element, if present.

The client MAY add any other metadata formats or foreign markup to the atom:entry element

We interpret this to mean that in addition to Dublin Core (dcterms, specifically), the SWORD spec is flexible enough to support wildly different metadata formats such as DataCite (https://www.datacite.org ), DDI (Data Documentation Initiative: http://www.ddialliance.org ), VO (Virtual Observatory: http://www.ivoa.net/documents/latest/RM.html ) ISA-Tab (Investigation, Study, and Assay in Tabular format: http://isatab.sourceforge.net/format.html ), etc.

We're not sure if any other SWORD server implementation is going beyond dcterms, however, which is what the spec requires. We'll ask on the mailing list.

raprasad · 2014-09-02T14:27:46Z

Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-06-02T20:19:30Z

See also the discussion I kicked off here:

[sword-app-tech] client SHOULD add Dublin Core terms to the Atom Entry, MAY add any other metadata formats or foreign markup - http://www.mail-archive.com/[email protected]/msg00384.html

I'd rather focus effort on our "native" API in 4.0, however, for supporting more metadata. It's already working. Docs at https://github.com/IQSS/dataverse/tree/master/scripts/api

posixeleni · 2014-09-22T21:46:25Z

Will start with the Ubiquity Press Datasets as an example for what metadata fields we should extend support for in version 1.1 of the plugin. https://docs.google.com/document/d/1CRGw4nbOS0ccynJdq0Am-7I9uazt-dsxkO6dWMpn-ts/edit?usp=sharing

Will eventually extend support to other metadata schemas (DDI, DataCite, etc) but the SWORD plugin may not be used for this but instead use the native JSON API.

cc/ @pdurbin

posixeleni · 2014-09-23T18:19:17Z

@axfelix @jwhitney @pdurbin Here is a sample atom-xml file that i put together based on this dataset:
http://dx.doi.org/10.7910/DVN/23791 There are a few attribute hacks I have added in under Creator (affiliation), and Contributor (Funder). ~~I also added a new field for date that clearly distinguishes between publication date (dcterms:available) and when the dataset was produced (dcterms:created)~~. Please let me know if we can support any of this for the plugin and API.

<?xml version="1.0" encoding="UTF-8"?>

<entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/">
    <dcterms:title>“Changes in test-taking patterns over time” concerning the Flynn Effect in Estonia</dcterms:title>
            <!-- description of the data rather than the article -->
    <dcterms:description>The dataset from our previous Intelligence paper consists of data collected with the National Intelligence Tests (NIT, Estonian adaptation) in two historical time points: in 1934/36 (N=890) and 2006 (N=913) for students with an average age of 13. The data-file consists of information about cohort, age, and gender and test results at the item level for nine of the ten NIT subtests and subtest scores for the 10th subtest. Three answer types are separated: right answer, wrong answer and missing answer. Data can be used for psychometric research of cohort and sex differences at the scale and item level.</dcterms:description>
     <!-- Author and Affiliation are being used by this particular user. Affiliation is another attribute hack -->
    <dcterms:creator affiliation="Department of Psychology, University of Tartu, Estonia">Must, Olev</dcterms:creator>
    <dcterms:creator affiliation="Department of Psychology, University of Tartu, Estonia">Must, Aasa</dcterms:creator>
    <dcterms:contributor type="Funder">Estonian Scientific Foundation: grant no 2387 and 5856.   European Social Fund: a Primus  grant (#3-8.2/60) to Anu Realo.  Baylor University financial support for data quality control in archive.</dcterms:contributor>
    <!-- dataset producer rather than Journal Publisher: -->
    <dcterms:publisher>Insert Dataset publisher</dcterms:publisher>
    <dcterms:rights>Journal copyright, license or terms of use notice</dcterms:rights>
            <!-- production date in Dataverse: -->
    <!-- date, if published: -->
    <dcterms:date>2014-09-22</dcterms:date>
    <dcterms:available>2014-09-22</dcterms:available>
    <!-- URI attributes, if published: -->
        <dcterms:isReferencedBy agency="DOI" IDNo="" holdingsURI="http://dx.doi.org/">Must, O., &amp; Must, A. (2014). Sample submission. Journal Of Plugin Testing, 1(2).</dcterms:isReferencedBy>
    <!-- Discipline, subject classification, keywords & coverage, if journal has enabled these fields enabled in article metadata forms -->
    <dcterms:subject>Academic discipline</dcterms:subject>
    <dcterms:subject>Subject classification</dcterms:subject>
    <dcterms:subject>Article keywords</dcterms:subject>
    <dcterms:subject>Geographic coverage</dcterms:subject>
    <!-- Supplementary file: subject, type -->
    <dcterms:subject>Keyword 1, keyword 2, keyword 3</dcterms:subject>
    <dcterms:type>Data Set</dcterms:type>
</entry>

axfelix · 2014-09-23T21:41:57Z

This looks sensible enough -- the affiliation="" element of dcterms:creator is no worse than the hack we already made to isReferencedBy and I think it's fair to say that we're still not overloading it. Thanks!

jwhitney · 2014-09-23T22:10:07Z

The plugin could provide dcterms:contributor type="funder"> through the article-level sponsoring agencies field:

Affiliation as you have it in the example (dept., org., country) is split across several optional fields. Block text would need to be reformatted into a phrase & HTML stripped, but it's possible to provide a reasonable value.

I'm not sure what's the best way to provide a description or date that describes the dataset rather than the article. Right now, the plugin maps article metadata to the dataset, then suppfile metadata is added to dataset fields that allow multiple values, like keyword. So if an OJS article has more than one suppfile in a dataset, suppfile-level keywords are combined & mapped up to the dataset, but that's not going to work well for fields that expect a single value, like ~~date~~ description or date available.

@axfelix, any thoughts? Re-thinking the mapping (e.g.., create one dataset for each suppfile) means multiple data citations / article (which is maybe ok, although seems excessive) but would provide finer control over what shows up on the Dataverse side.

posixeleni · 2014-09-24T14:14:48Z

@jwhitney @axfelix Newbie question: For "Date Available" (when a dataset is published) is this information system generated when you send the command to Dataverse to Release the dataset?

About the one dataset for each suppfile suggestion: this would be problematic on our end since we would ultimately want all files that belong within a single dataset to be included together (one data citation). Is it possible to aggregate some of the metadata from individual supp files being uploaded or make it that the author is filling in metadata at the entire dataset-level rather than file-by-file which we do not yet have a way to index/store this information in Dataverse? I imagine most people would fill in any relevant dataset related information the first time they add a supp file? Not sure how this would work in your system though.

jwhitney · 2014-09-24T17:18:29Z

If you're going to be fairly conservative in the number of dataset-level fields, we could look at handling the description the same way as external data citations. The field's presented in the suppfile form, but is stored in article metadata, so the same value's shared across suppfiles.

axfelix · 2014-09-24T17:19:45Z

I was afraid of suggesting that for the amount of additional logic it'd take to show the same field (and pre-populate it with whatever was already entered?) on repeat suppfile uploads, but if you want to take that route, it's fine with me!

posixeleni · 2014-09-24T18:40:34Z

@jwhitney how much time do you think you would need to implement these changes? Want to make sure our deadline (assigned to @pdurbin) for completing our part of this gives you enough time to do your part before the end of November.

cc/ @axfelix

jwhitney · 2014-09-24T21:02:58Z

That's lots of time: adding metadata is fairly straightforward.

Do you want to drop article abstract from dataset metadata altogether, in favour of the desc. field to be added to the suppfile form? Or use abstract as a default if dataset desc. not provided?

posixeleni · 2014-09-25T13:49:01Z

I prefer your second option:

use abstract as a default if dataset desc. not provided

mercecrosas · 2014-09-25T14:08:37Z

Alex, I agree completely - we should not create one dataset per file.

On Sep 24, 2014, at 6:31 PM, axfelix [email protected] wrote:

I'd rather not have to create one dataset for each suppfile -- that seems inconsistent with Dataverse attempting to provide a single Data Citation / DOI for a landing page that includes the entire study w/ multiple files, and probably creates more cruft than necessary.

We've talked a few times now about the possibility of providing dataset-level metadata that is a) common to all of the suppfiles being uploaded but b) necessarily different from article-level metadata, and we haven't figured out a desirable solution yet.

I'm not quite seeing what the problem is with date available, since that wouldn't be filled by an author anyhow -- that should be an internal value that's getting pulled from the DB at the time of publication, no? As for description, we could probably add as many as 1-2 fields on the OJS "add a suppfile / these are the suppfiles you've uploaded so far" page (i.e., not the "metadata per this individual suppfile page") for "any descriptive information that is common to all of your supplementary data but not to the article," though I imagine most authors would disregard it, and I'd want to be really conservative here.

—
Reply to this email directly or view it on GitHub.

axfelix · 2014-09-25T14:19:09Z

Yeah, I agree with Eleni -- use abstract if dataset description not provided.

pdurbin · 2014-10-17T13:45:21Z

There are a few attribute hacks I have added in under Creator (affiliation), and Contributor (Funder).

This actually seems nicely in line with "New affiliation attributes for Creator and Contributor" in the just-released DataCite Metadata Schema Version 3.1: http://www.datacite.org/node/141

It's like @posixeleni is psychic. :)

I wonder if there's anything else in there we should consider.

posixeleni · 2014-10-17T14:20:01Z

Spoke @pdurbin and clarified that the final requirement for this ticket is that we modify the atom xml for two elements:

Under dcterms:creator add the attribute affiliation="foo" which would map to our authorAffiliation field
Under dcterms:contributor add the attribute type="Funder". Note that the value entered in type= needs to match our list of contributorType controlled vocabulary found in https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/citation.tsv

pdurbin · 2014-10-17T14:26:19Z

Right, and to be clear, we're no longer making any changes to dates. We used some ~~strikethrough~~ above for dcterms:created and dcterms:available, the equivalents of which are both system generated. We won't be supporting use of these elements in SWORD entry XML.

axfelix · 2014-10-17T15:40:44Z

Nice to see they're being proactive about changes to the DataCite standard at this point. Nothing else seems immediately worth adopting, but still good.

pdurbin · 2014-10-17T19:32:06Z

As of 5257f0d we now support adding authorAffiliation, contributorName, and contributorType via SWORD via elements like this:

+ <dcterms:creator affiliation="Department of Psychology, University of Tartu, Estonia">Must, Olev</dcterms:creator>
+ <dcterms:contributor type="Funder">Estonian Scientific Foundation: grant no 2387 and 5856. European Social Fund: a Primus grant (#3-8.2/60) to Anu Realo. Baylor University financial support for data quality control in archive.</dcterms:contributor>

@kcondon to test you'll first need to run these SQL statements...

INSERT INTO foreignmetadatafieldmapping (id, foreignfieldxpath, metadatablockname, datasetfieldname, isattribute, parentfieldmapping_id, foreignmetadataformatmapping_id) VALUES (15, 'affiliation', 'citation', 'authorAffiliation', TRUE, 3, 1 );
INSERT INTO foreignmetadatafieldmapping (id, foreignfieldxpath, metadatablockname, datasetfieldname, isattribute, parentfieldmapping_id, foreignmetadataformatmapping_id) VALUES (16, ':contributor', 'citation', 'contributorName', FALSE, NULL, 1 );
INSERT INTO foreignmetadatafieldmapping (id, foreignfieldxpath, metadatablockname, datasetfieldname, isattribute, parentfieldmapping_id, foreignmetadataformatmapping_id) VALUES (17, 'type', 'citation', 'contributorType', TRUE, 16, 1 );

(or drop the database and set up again)

... then, if you run https://github.com/IQSS/dataverse/blob/master/scripts/api/data-deposit/create-dataset-899-expansion from the root of the repo (operates on https://github.com/IQSS/dataverse/blob/master/scripts/api/data-deposit/data/atom-entry-study-899-expansion.xml ) you should end up with a dataset like the one below.

Unfortunately, it is expected that the controlled vocabulary of contributorType (Editor, Funder, Researcher, etc. is not enforced but we will fix this in #973.

@posixeleni I left a reminder in the SWORD backward compatibility doc that we need to document this new functionality: "FIXME: Document via example (in XML) how we now support authorAffiliation, contributorName, and contributorType as of #899."

pdurbin · 2014-10-24T18:11:59Z

@esotiri or @kcondon after you have tested this on dvn-build can you please do a build on https://apitest.dataverse.org so @jwhitney can test? This was mentioned at by @posixeleni at [pkp-dataverse-integration] Email Your Updates for PKP-OJS project. Thanks!

esotiri · 2014-10-27T14:13:39Z

the latest code is in https://apitest.dataverse.org for @jwhitney to test.

jwhitney · 2014-10-27T14:57:26Z

Thanks! Testing now.

- dcterms:available never came to fruition, see #899 - also multiple dcterms:coverage works fine now, enabling

esotiri · 2014-10-29T15:25:14Z

dataset created with atom-entry-study-899-expansion.xml

esotiri · 2014-10-29T15:36:43Z

issue resolved

pdurbin · 2014-10-29T15:36:58Z

Unable to parse SWORD entry

This is the expected error if you try to use SWORD with a non-existent XML input file. Please see #893 (comment) for details.

raprasad added Type: Feature labels Sep 2, 2014

raprasad added this to the Dataverse 4.0: In Review milestone Sep 2, 2014

eaquigley modified the milestones: Dataverse 4.0: In Review, In Review - Dataverse 4.0, Beta 8 - Dataverse 4.0 Sep 2, 2014

kcondon added the Status: QA label Sep 9, 2014

posixeleni removed the Status: QA label Sep 22, 2014

posixeleni self-assigned this Sep 22, 2014

posixeleni added the Feature: API label Sep 22, 2014

posixeleni assigned pdurbin and unassigned posixeleni Sep 23, 2014

posixeleni added this to the Beta 9 - Dataverse 4.0 milestone Sep 23, 2014

posixeleni changed the title ~~Expand Data Deposit API support to additional metadata schemas (ex. DDI)~~ Expand Data Deposit API to support additional metadata Sep 23, 2014

pdurbin modified the milestones: Beta 8 - Dataverse 4.0, Beta 9 - Dataverse 4.0 Oct 14, 2014

pdurbin added a commit that referenced this issue Oct 17, 2014

SWORD: support author affiliation, contributor name/type #899

5257f0d

pdurbin removed their assignment Oct 17, 2014

pdurbin added Status: QA and removed Status: Design labels Oct 17, 2014

pdurbin mentioned this issue Oct 20, 2014

Data Deposit API: Contributor Type can be changed #973

Closed

kcondon assigned esotiri Oct 24, 2014

pdurbin added a commit that referenced this issue Oct 28, 2014

SWORD: updated example XML for metadata expansion #899

fb88c17

- dcterms:available never came to fruition, see #899 - also multiple dcterms:coverage works fine now, enabling

esotiri closed this as completed Oct 29, 2014

bencomp mentioned this issue Jul 15, 2015

Connect metadata field names and blocks to (de facto) standard ontologies #2357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand Data Deposit API to support additional metadata #899

Expand Data Deposit API to support additional metadata #899

raprasad commented Sep 2, 2014

raprasad commented Sep 2, 2014

raprasad commented Sep 2, 2014

raprasad commented Sep 2, 2014

posixeleni commented Sep 22, 2014

posixeleni commented Sep 23, 2014

axfelix commented Sep 23, 2014

jwhitney commented Sep 23, 2014

posixeleni commented Sep 24, 2014

jwhitney commented Sep 24, 2014

axfelix commented Sep 24, 2014

posixeleni commented Sep 24, 2014

jwhitney commented Sep 24, 2014

posixeleni commented Sep 25, 2014

mercecrosas commented Sep 25, 2014

axfelix commented Sep 25, 2014

pdurbin commented Oct 17, 2014

posixeleni commented Oct 17, 2014

pdurbin commented Oct 17, 2014

axfelix commented Oct 17, 2014

pdurbin commented Oct 17, 2014

pdurbin commented Oct 24, 2014

esotiri commented Oct 27, 2014

jwhitney commented Oct 27, 2014

esotiri commented Oct 29, 2014

esotiri commented Oct 29, 2014

pdurbin commented Oct 29, 2014

Expand Data Deposit API to support additional metadata #899

Expand Data Deposit API to support additional metadata #899

Comments

raprasad commented Sep 2, 2014

raprasad commented Sep 2, 2014

raprasad commented Sep 2, 2014

raprasad commented Sep 2, 2014

posixeleni commented Sep 22, 2014

posixeleni commented Sep 23, 2014

axfelix commented Sep 23, 2014

jwhitney commented Sep 23, 2014

posixeleni commented Sep 24, 2014

jwhitney commented Sep 24, 2014

axfelix commented Sep 24, 2014

posixeleni commented Sep 24, 2014

jwhitney commented Sep 24, 2014

posixeleni commented Sep 25, 2014

mercecrosas commented Sep 25, 2014

axfelix commented Sep 25, 2014

pdurbin commented Oct 17, 2014

posixeleni commented Oct 17, 2014

pdurbin commented Oct 17, 2014

axfelix commented Oct 17, 2014

pdurbin commented Oct 17, 2014

pdurbin commented Oct 24, 2014

esotiri commented Oct 27, 2014

jwhitney commented Oct 27, 2014

esotiri commented Oct 29, 2014

esotiri commented Oct 29, 2014

pdurbin commented Oct 29, 2014