Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand Data Deposit API to support additional metadata #899

Closed
raprasad opened this issue Sep 2, 2014 · 28 comments
Closed

Expand Data Deposit API to support additional metadata #899

raprasad opened this issue Sep 2, 2014 · 28 comments
Assignees
Labels

Comments

@raprasad
Copy link
Contributor

raprasad commented Sep 2, 2014


Author Name: Eleni Castro (@posixeleni)
Original Redmine Issue: 3425, https://redmine.hmdc.harvard.edu/issues/3425
Original Date: 2014-01-22


So far, according to the OJS Dataverse plugin testers surveyed with results recorded at https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodDJyd0pZdnlDZ3I5eWxnOHBmV1Q4dHc&usp=sharing the most commonly requested feature is the ability to customize which metadata fields are available as part of the data deposit form, which should be implemented in a future version. In order to support this, we will need to expand the API's metadata support beyond Dublin Core metadata. SWORD Protocol should be flexible enough for us to use other standards like DDI. At http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_creatingresource_entry the SWORDv2 spec says (emphasis added):

  • The client SHOULD add Dublin Core [DublinCore] terms to the Atom Entry as foreign markup (if appropriate); the terms MUST be embedded as direct children of the atom:entry element, if present.
  • The client MAY add any other metadata formats or foreign markup to the atom:entry element

We interpret this to mean that in addition to Dublin Core (dcterms, specifically), the SWORD spec is flexible enough to support wildly different metadata formats such as DataCite (https://www.datacite.org ), DDI (Data Documentation Initiative: http://www.ddialliance.org ), VO (Virtual Observatory: http://www.ivoa.net/documents/latest/RM.html ) ISA-Tab (Investigation, Study, and Assay in XML format: http://isatab.sourceforge.net/docs/Wiemann_SupplFile4.xml ), etc.

We're not sure if any other SWORD server implementation is going beyond dcterms, however, which is what the spec requires. We'll ask on the mailing list.

@raprasad
Copy link
Contributor Author

raprasad commented Sep 2, 2014


Original Redmine Comment
Author Name: Eleni Castro (@posixeleni)
Original Date: 2014-01-23T21:12:50Z


This would line up appropriately with the metadata expansion that we are actively working on for Dataverse 4.0.

@raprasad
Copy link
Contributor Author

raprasad commented Sep 2, 2014


Original Redmine Comment
Author Name: Eleni Castro (@posixeleni)
Original Date: 2014-02-10T17:56:14Z


Eleni Castro wrote:

So far, according to the OJS Dataverse plugin testers surveyed with results recorded at https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodDJyd0pZdnlDZ3I5eWxnOHBmV1Q4dHc&usp=sharing the most commonly requested feature is the ability to customize which metadata fields are available as part of the data deposit form, which should be implemented in a future version. In order to support this, we will need to expand the API's metadata support beyond Dublin Core metadata. SWORD Protocol should be flexible enough for us to use other standards like DDI. At http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_creatingresource_entry the SWORDv2 spec says (emphasis added):

  • The client SHOULD add Dublin Core [DublinCore] terms to the Atom Entry as foreign markup (if appropriate); the terms MUST be embedded as direct children of the atom:entry element, if present.
  • The client MAY add any other metadata formats or foreign markup to the atom:entry element

We interpret this to mean that in addition to Dublin Core (dcterms, specifically), the SWORD spec is flexible enough to support wildly different metadata formats such as DataCite (https://www.datacite.org ), DDI (Data Documentation Initiative: http://www.ddialliance.org ), VO (Virtual Observatory: http://www.ivoa.net/documents/latest/RM.html ) ISA-Tab (Investigation, Study, and Assay in Tabular format: http://isatab.sourceforge.net/format.html ), etc.

We're not sure if any other SWORD server implementation is going beyond dcterms, however, which is what the spec requires. We'll ask on the mailing list.

@raprasad
Copy link
Contributor Author

raprasad commented Sep 2, 2014


Original Redmine Comment
Author Name: Philip Durbin (@pdurbin)
Original Date: 2014-06-02T20:19:30Z


See also the discussion I kicked off here:

[sword-app-tech] client SHOULD add Dublin Core terms to the Atom Entry, MAY add any other metadata formats or foreign markup - http://www.mail-archive.com/[email protected]/msg00384.html

I'd rather focus effort on our "native" API in 4.0, however, for supporting more metadata. It's already working. Docs at https://github.com/IQSS/dataverse/tree/master/scripts/api

@raprasad raprasad added this to the Dataverse 4.0: In Review milestone Sep 2, 2014
@eaquigley eaquigley modified the milestones: Dataverse 4.0: In Review, In Review - Dataverse 4.0, Beta 8 - Dataverse 4.0 Sep 2, 2014
@posixeleni posixeleni self-assigned this Sep 22, 2014
@posixeleni
Copy link
Contributor

Will start with the Ubiquity Press Datasets as an example for what metadata fields we should extend support for in version 1.1 of the plugin. https://docs.google.com/document/d/1CRGw4nbOS0ccynJdq0Am-7I9uazt-dsxkO6dWMpn-ts/edit?usp=sharing

Will eventually extend support to other metadata schemas (DDI, DataCite, etc) but the SWORD plugin may not be used for this but instead use the native JSON API.

cc/ @pdurbin

@posixeleni
Copy link
Contributor

@axfelix @jwhitney @pdurbin Here is a sample atom-xml file that i put together based on this dataset:
http://dx.doi.org/10.7910/DVN/23791 There are a few attribute hacks I have added in under Creator (affiliation), and Contributor (Funder). I also added a new field for date that clearly distinguishes between publication date (dcterms:available) and when the dataset was produced (dcterms:created). Please let me know if we can support any of this for the plugin and API.

<?xml version="1.0" encoding="UTF-8"?>

<entry xmlns="http://www.w3.org/2005/Atom" xmlns:dcterms="http://purl.org/dc/terms/">
    <dcterms:title>“Changes in test-taking patterns over time” concerning the Flynn Effect in Estonia</dcterms:title>
            <!-- description of the data rather than the article -->
    <dcterms:description>The dataset from our previous Intelligence paper consists of data collected with the National Intelligence Tests (NIT, Estonian adaptation) in two historical time points: in 1934/36 (N=890) and 2006 (N=913) for students with an average age of 13. The data-file consists of information about cohort, age, and gender and test results at the item level for nine of the ten NIT subtests and subtest scores for the 10th subtest. Three answer types are separated: right answer, wrong answer and missing answer. Data can be used for psychometric research of cohort and sex differences at the scale and item level.</dcterms:description>
     <!-- Author and Affiliation are being used by this particular user. Affiliation is another attribute hack -->
    <dcterms:creator affiliation="Department of Psychology, University of Tartu, Estonia">Must, Olev</dcterms:creator>
    <dcterms:creator affiliation="Department of Psychology, University of Tartu, Estonia">Must, Aasa</dcterms:creator>
    <dcterms:contributor type="Funder">Estonian Scientific Foundation: grant no 2387 and 5856.   European Social Fund: a Primus  grant (#3-8.2/60) to Anu Realo.  Baylor University financial support for data quality control in archive.</dcterms:contributor>
    <!-- dataset producer rather than Journal Publisher: -->
    <dcterms:publisher>Insert Dataset publisher</dcterms:publisher>
    <dcterms:rights>Journal copyright, license or terms of use notice</dcterms:rights>
            <!-- production date in Dataverse: -->
    <!-- date, if published: -->
    <dcterms:date>2014-09-22</dcterms:date>
    <dcterms:available>2014-09-22</dcterms:available>
    <!-- URI attributes, if published: -->
        <dcterms:isReferencedBy agency="DOI" IDNo="" holdingsURI="http://dx.doi.org/">Must, O., &amp; Must, A. (2014). Sample submission. Journal Of Plugin Testing, 1(2).</dcterms:isReferencedBy>
    <!-- Discipline, subject classification, keywords & coverage, if journal has enabled these fields enabled in article metadata forms -->
    <dcterms:subject>Academic discipline</dcterms:subject>
    <dcterms:subject>Subject classification</dcterms:subject>
    <dcterms:subject>Article keywords</dcterms:subject>
    <dcterms:subject>Geographic coverage</dcterms:subject>
    <!-- Supplementary file: subject, type -->
    <dcterms:subject>Keyword 1, keyword 2, keyword 3</dcterms:subject>
    <dcterms:type>Data Set</dcterms:type>
</entry>

@posixeleni posixeleni assigned pdurbin and unassigned posixeleni Sep 23, 2014
@posixeleni posixeleni added this to the Beta 9 - Dataverse 4.0 milestone Sep 23, 2014
@posixeleni posixeleni changed the title Expand Data Deposit API support to additional metadata schemas (ex. DDI) Expand Data Deposit API to support additional metadata Sep 23, 2014
@axfelix
Copy link

axfelix commented Sep 23, 2014

This looks sensible enough -- the affiliation="" element of dcterms:creator is no worse than the hack we already made to isReferencedBy and I think it's fair to say that we're still not overloading it. Thanks!

@jwhitney
Copy link

The plugin could provide dcterms:contributor type="funder"> through the article-level sponsoring agencies field:

sponsors

Affiliation as you have it in the example (dept., org., country) is split across several optional fields. Block text would need to be reformatted into a phrase & HTML stripped, but it's possible to provide a reasonable value.

affiliation

I'm not sure what's the best way to provide a description or date that describes the dataset rather than the article. Right now, the plugin maps article metadata to the dataset, then suppfile metadata is added to dataset fields that allow multiple values, like keyword. So if an OJS article has more than one suppfile in a dataset, suppfile-level keywords are combined & mapped up to the dataset, but that's not going to work well for fields that expect a single value, like date description or date available.

@axfelix, any thoughts? Re-thinking the mapping (e.g.., create one dataset for each suppfile) means multiple data citations / article (which is maybe ok, although seems excessive) but would provide finer control over what shows up on the Dataverse side.

@posixeleni
Copy link
Contributor

@jwhitney @axfelix Newbie question: For "Date Available" (when a dataset is published) is this information system generated when you send the command to Dataverse to Release the dataset?

About the one dataset for each suppfile suggestion: this would be problematic on our end since we would ultimately want all files that belong within a single dataset to be included together (one data citation). Is it possible to aggregate some of the metadata from individual supp files being uploaded or make it that the author is filling in metadata at the entire dataset-level rather than file-by-file which we do not yet have a way to index/store this information in Dataverse? I imagine most people would fill in any relevant dataset related information the first time they add a supp file? Not sure how this would work in your system though.

@jwhitney
Copy link

If you're going to be fairly conservative in the number of dataset-level fields, we could look at handling the description the same way as external data citations. The field's presented in the suppfile form, but is stored in article metadata, so the same value's shared across suppfiles.

@axfelix
Copy link

axfelix commented Sep 24, 2014

I was afraid of suggesting that for the amount of additional logic it'd take to show the same field (and pre-populate it with whatever was already entered?) on repeat suppfile uploads, but if you want to take that route, it's fine with me!

@posixeleni
Copy link
Contributor

@jwhitney how much time do you think you would need to implement these changes? Want to make sure our deadline (assigned to @pdurbin) for completing our part of this gives you enough time to do your part before the end of November.

cc/ @axfelix

@jwhitney
Copy link

That's lots of time: adding metadata is fairly straightforward.

Do you want to drop article abstract from dataset metadata altogether, in favour of the desc. field to be added to the suppfile form? Or use abstract as a default if dataset desc. not provided?

@posixeleni
Copy link
Contributor

I prefer your second option:

use abstract as a default if dataset desc. not provided

@mercecrosas
Copy link
Member

Alex, I agree completely - we should not create one dataset per file.

On Sep 24, 2014, at 6:31 PM, axfelix [email protected] wrote:

I'd rather not have to create one dataset for each suppfile -- that seems inconsistent with Dataverse attempting to provide a single Data Citation / DOI for a landing page that includes the entire study w/ multiple files, and probably creates more cruft than necessary.

We've talked a few times now about the possibility of providing dataset-level metadata that is a) common to all of the suppfiles being uploaded but b) necessarily different from article-level metadata, and we haven't figured out a desirable solution yet.

I'm not quite seeing what the problem is with date available, since that wouldn't be filled by an author anyhow -- that should be an internal value that's getting pulled from the DB at the time of publication, no? As for description, we could probably add as many as 1-2 fields on the OJS "add a suppfile / these are the suppfiles you've uploaded so far" page (i.e., not the "metadata per this individual suppfile page") for "any descriptive information that is common to all of your supplementary data but not to the article," though I imagine most authors would disregard it, and I'd want to be really conservative here.


Reply to this email directly or view it on GitHub.

@axfelix
Copy link

axfelix commented Sep 25, 2014

Yeah, I agree with Eleni -- use abstract if dataset description not provided.

@pdurbin
Copy link
Member

pdurbin commented Oct 17, 2014

There are a few attribute hacks I have added in under Creator (affiliation), and Contributor (Funder).

This actually seems nicely in line with "New affiliation attributes for Creator and Contributor" in the just-released DataCite Metadata Schema Version 3.1: http://www.datacite.org/node/141

It's like @posixeleni is psychic. :)

I wonder if there's anything else in there we should consider.

@posixeleni
Copy link
Contributor

Spoke @pdurbin and clarified that the final requirement for this ticket is that we modify the atom xml for two elements:

@pdurbin
Copy link
Member

pdurbin commented Oct 17, 2014

Right, and to be clear, we're no longer making any changes to dates. We used some strikethrough above for dcterms:created and dcterms:available, the equivalents of which are both system generated. We won't be supporting use of these elements in SWORD entry XML.

@axfelix
Copy link

axfelix commented Oct 17, 2014

Nice to see they're being proactive about changes to the DataCite standard at this point. Nothing else seems immediately worth adopting, but still good.

@pdurbin
Copy link
Member

pdurbin commented Oct 17, 2014

As of 5257f0d we now support adding authorAffiliation, contributorName, and contributorType via SWORD via elements like this:

+ <dcterms:creator affiliation="Department of Psychology, University of Tartu, Estonia">Must, Olev</dcterms:creator>
+ <dcterms:contributor type="Funder">Estonian Scientific Foundation: grant no 2387 and 5856. European Social Fund: a Primus grant (#3-8.2/60) to Anu Realo. Baylor University financial support for data quality control in archive.</dcterms:contributor>

@kcondon to test you'll first need to run these SQL statements...

INSERT INTO foreignmetadatafieldmapping (id, foreignfieldxpath, metadatablockname, datasetfieldname, isattribute, parentfieldmapping_id, foreignmetadataformatmapping_id) VALUES (15, 'affiliation', 'citation', 'authorAffiliation', TRUE, 3, 1 );
INSERT INTO foreignmetadatafieldmapping (id, foreignfieldxpath, metadatablockname, datasetfieldname, isattribute, parentfieldmapping_id, foreignmetadataformatmapping_id) VALUES (16, ':contributor', 'citation', 'contributorName', FALSE, NULL, 1 );
INSERT INTO foreignmetadatafieldmapping (id, foreignfieldxpath, metadatablockname, datasetfieldname, isattribute, parentfieldmapping_id, foreignmetadataformatmapping_id) VALUES (17, 'type', 'citation', 'contributorType', TRUE, 16, 1 );

(or drop the database and set up again)

... then, if you run https://github.com/IQSS/dataverse/blob/master/scripts/api/data-deposit/create-dataset-899-expansion from the root of the repo (operates on https://github.com/IQSS/dataverse/blob/master/scripts/api/data-deposit/data/atom-entry-study-899-expansion.xml ) you should end up with a dataset like the one below.

Unfortunately, it is expected that the controlled vocabulary of contributorType (Editor, Funder, Researcher, etc. is not enforced but we will fix this in #973.

@posixeleni I left a reminder in the SWORD backward compatibility doc that we need to document this new functionality: "FIXME: Document via example (in XML) how we now support authorAffiliation, contributorName, and contributorType as of #899."

899

@pdurbin
Copy link
Member

pdurbin commented Oct 24, 2014

@esotiri or @kcondon after you have tested this on dvn-build can you please do a build on https://apitest.dataverse.org so @jwhitney can test? This was mentioned at by @posixeleni at [pkp-dataverse-integration] Email Your Updates for PKP-OJS project. Thanks!

@esotiri
Copy link
Contributor

esotiri commented Oct 27, 2014

the latest code is in https://apitest.dataverse.org for @jwhitney to test.

@jwhitney
Copy link

Thanks! Testing now.

pdurbin added a commit that referenced this issue Oct 28, 2014
- dcterms:available never came to fruition, see #899
- also multiple dcterms:coverage works fine now, enabling
@esotiri
Copy link
Contributor

esotiri commented Oct 29, 2014

dataset created with atom-entry-study-899-expansion.xml

@esotiri
Copy link
Contributor

esotiri commented Oct 29, 2014

issue resolved

@esotiri esotiri closed this as completed Oct 29, 2014
@pdurbin
Copy link
Member

pdurbin commented Oct 29, 2014

Unable to parse SWORD entry

This is the expected error if you try to use SWORD with a non-existent XML input file. Please see #893 (comment) for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants