Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add paratext systemId elements to metadata 2.1.1 #17

Open
ericpyle opened this issue Nov 9, 2017 · 26 comments
Open

add paratext systemId elements to metadata 2.1.1 #17

ericpyle opened this issue Nov 9, 2017 · 26 comments

Comments

@ericpyle
Copy link
Contributor

ericpyle commented Nov 9, 2017

In metadata 2.1.1:

<systemId type="paratext">
    <projectType>Auxiliary</projectType>
    <basedOn>
           <name>engWEB14</name>
            <id>9879dbb7cfe39e4d5d6f7f9dbd0c6414691a036e</id>
    </basedOn>
</systemId>

Where optional <projectType> can be Standard|Daughter|StudyBible|StudyBibleAdditions|BackTranslation|Auxiliary|TransliterationManual|TransliterationWithEncoder|ConsultantNotes|GlobalConsultantNotes|GlobalAnthropologyNotes

Where optional <basedOn> has required <name> (lenGe2String) and required <id> (ptId)`

In text metadata 1.5:

<systemId type="paratext"  
projectType="Auxiliary" 
basedOnName="engWEB14"
basedOnId="9879dbb7cfe39e4d5d6f7f9dbd0c6414691a036e"></systemId>
@ericpyle
Copy link
Contributor Author

ericpyle commented Nov 9, 2017

see also #8 (review) (under ## add new elements to text metadata 2.0 under to capture alternative project types)

@ericpyle
Copy link
Contributor Author

ericpyle commented Nov 13, 2017

@klassenjm how should we proceed on this to unblock Lois who is trying to upload to DBL a (transliteration) project that has already been uploaded? (https://thedigitalbiblelibrary.org/entry?id=19b20027201977c5)

  1. I can remove (at least partially) my restriction in Paratext from uploading projects with a basedOn text.
  2. We can prioritize fixing 1.5 and 2.1.1 to support basedOn elements to capture metadata for these types of uploads.

@klassenjm klassenjm added this to the Sprint 4 milestone Nov 13, 2017
@klassenjm
Copy link
Contributor

klassenjm commented Nov 13, 2017

My current thoughts:

That we not support upload of the following as regular DBL entries (resource only uploads excepted):

  • StudyBibleAdditions
  • ConsultantNotes
  • GlobalConsultantNotes
  • GlobalAnthropologyNotes

It would be very nice if we could retain basedOn. What do we capture there? The PT project GUID? a name? If we could capture the GUID and see if that was already an entry in DBL - and then set up a relation in the DBL metadata -- would that be good? (I think so -- but I don't feel like I want to come across as though it's a requirement yet).

@mvahowe
Copy link
Contributor

mvahowe commented Nov 13, 2017

@ericpyle @klassenjm Could we do this on the wishlist so we have some hope of finding the discussion when we come to make the decisions? https://github.com/ubsicap/dbl-archive-validation/blob/master/v2/2_1_1/wishlist.md

@klassenjm
Copy link
Contributor

Yes

@ericpyle
Copy link
Contributor Author

@klassenjm I know that Biblica and others were hoping we'd support Auxiliary, so they have a workflow that supports forking active projects before uploading it to DBL. I don't remember where that discussion landed given that PT Registry does not host those projects, except that perhaps the PT uploader could potentially borrow the needed metadata from basedOn (unless that just is not workable in all cases).

also fwiw, DBL already has GlobalAnthropologyNotes as resourceOnly. Certainly the PT uploader should have certain precautions about projects that typically get uploaded, but resourceOnly does allow non-typical projects. We aren't technically running the same schema validation for resourceOnly uploads, but at the same time we are using the same metadata structure for transporting metadata with resourceOnly uploads.

@ericpyle
Copy link
Contributor Author

@mvahowe sorry, please move my comment to the appropriate place. I was expecting a pull-request type place to do the comment, but it doesn't look like that's what we can do on the document you linked to?

@klassenjm
Copy link
Contributor

@ericpyle I changed my "current thoughts". Will look at updating the wishlist.

@ericpyle
Copy link
Contributor Author

@mvahowe I'm a little confused about how to maintain a discussion in a wishlist? Please advise.

@klassenjm what are your "current thoughts"?

Re: basedOn.

What do we capture there? The PT project GUID? a name?

Both, as you can see in the example snippet at the top:

    <basedOn>
           <name>engWEB14</name>
            <id>9879dbb7cfe39e4d5d6f7f9dbd0c6414691a036e</id>
    </basedOn>

If we could capture the GUID and see if that was already an entry in DBL - and then set up a relation in the DBL metadata -- would that be good? (I think so -- but I don't feel like I want to come across as though it's a requirement yet).

I had asked @mvahowe or you about this in the past re: relationships element, and was advised against that (can't remember the reason). Perhaps it's in the old trello card. It would be good to support some kind of actionable linkage on the DBL webpage, at least.

@mvahowe
Copy link
Contributor

mvahowe commented Nov 14, 2017

@ericpyle, add your comments under the others and make a PR for the changes, which I'll approve. (Adding stuff to the wishlist doesn't mean we'll do it, for now we're just collecting the ideas.)

@smorrison smorrison modified the milestones: Sprint 4, Sprint 5 Nov 27, 2017
@klassenjm klassenjm modified the milestones: Sprint 5, Sprint 6 Dec 11, 2017
@klassenjm klassenjm modified the milestones: Sprint 6, Sprint 7 Jan 8, 2018
@klassenjm klassenjm removed this from the Sprint 7 milestone Jan 22, 2018
@mvahowe
Copy link
Contributor

mvahowe commented Jan 23, 2018

@ericpyle, insofar as I understand the problem we are trying to solve this looks ok. My main question is how urgently this needs to happen. If it's blocking something, we need to ship 2.1.1, which means discussing the other possible changes, finalizing the spec, consulting with partners, ensuring we can migrate all existing entries, writing the scripts to do that migration and then maybe migrating everything. How urgently does that need to happen?

@ericpyle
Copy link
Contributor Author

@mvahowe it's not urgent as far as 2.1.1 goes. I would just like confirmation of the data structure for projectType and basedOn so I can start using it for my resourceOnly uploads (which is not consumed by LCH partners). I suppose I could version my resourceOnly uploads metadata as "2.1.1" to be more consistent with the specification. Or maybe I could use "2.1.1r" to indicate it's not actually 2.1.1 but 2.1.1-like.

@mvahowe
Copy link
Contributor

mvahowe commented Jan 23, 2018

@ericpyle But doesn't using them for resourceOnly uploads mean that the server needs to validate using 2.1.1? If so I think we need to fix the specification of 2.1.1 first. If we don't, what happens when we do fix the specification and 2.1.1 then means something else?

@ericpyle
Copy link
Contributor Author

@mvahowe I'm planning to do a python based validation or just a schematron validation for resourceOnly uploads. Otherwise we'd need to add resourceOnly as another expression of metadata, but I wouldn't want you to feel the need to have add or refactor what you've already done for all the others schemas.

@mvahowe
Copy link
Contributor

mvahowe commented Jan 23, 2018

I'm quite happy to add a resourceOnly variant to the schema set and I suspect it will create less confusion in the long term (because, eg, any operation we run across all entries will need all entries to be valid according to the same schema.)

@ericpyle
Copy link
Contributor Author

ericpyle commented Jan 23, 2018

@mvahowe if we made a resourceOnly variant, would you propose that gets added to type/isParatextResourceOnly?

@mvahowe
Copy link
Contributor

mvahowe commented Jan 23, 2018

@ericpyle I think we could find a pithier label but, yes, something like that.

@ericpyle
Copy link
Contributor Author

@mvahowe can you suggest a pithier label? (Many regular uploads can become Paratext resources.)

@ericpyle
Copy link
Contributor Author

ericpyle commented Jan 23, 2018

@mvahowe @smorrison one thing we need to think about is how to handle book lists for resourceOnly uploads. These will only have source/source.zip files. However, we do need some way for canons data to be able to communicate which books the archivist would like DBL to include in the pt resource download. Perhaps that's just a matter of every src pointing to "source/source.zip" and role being the book name?

@klassenjm
Copy link
Contributor

When someone does a resourceOnly upload -- if I understand what happens today -- they need a Canon in Paratext and Paratext only includes in the the upload the books mentioned in the Canon.

What's the simplest way in the metadata we can refer to that list of books? What do resource downloads do today to know which books should be downloaded (the total of books mentioned in all Canons, I expect). Does it look at publications for that? If so, could we create one default publication for resourceOnly uploads?

@ericpyle
Copy link
Contributor Author

ericpyle commented Jan 25, 2018

@klassenjm yes, a canon is needed, although by default in the absence of a project canon, an adhoc in-memory canon will be created on the fly based on all books present. That currently populates the contents/bookList in metadata 1.5 which fills a bookList table. That table is used on download to know which books should be downloaded. For regular text uploads (not resourceOnly), we also append (on the fly) any peripheral books to what was listed in the bookList (since peripherals are not allowed in regular uploads).

In typical metadata 1.5 > 2.1 transformations/migration the old content/bookList maps to publication/canonicalContent and the order is given in structure/content. However, unless we are OKAY with letting peripherals in resourceOnly uploads also be listed in canonicalContent, we'd need to provide either another table or at least another metadata source which is joined to provide the full bookList for resourceOnly uploads. Personally, I'm OKAY with canonicalContent being used this way, since it "just works" in the current system, but I understand why @mvahowe might think that's insane

@mvahowe
Copy link
Contributor

mvahowe commented Feb 6, 2018

@ericpyle I believe we've now down this?

@ericpyle
Copy link
Contributor Author

ericpyle commented Feb 6, 2018

@mvahowe we still need projectType for text medium

@ericpyle
Copy link
Contributor Author

ericpyle commented Feb 6, 2018

@mvahowe so, I guess paratextZipResourceProjectType can be made more general. Sorry I didn't review that part of the spec that closely

@ericpyle
Copy link
Contributor Author

ericpyle commented Feb 6, 2018

@mvahowe although to be more valid, text medium should NOT be any of the note types

      <value>ConsultantNotes</value>
      <value>GlobalConsultantNotes</value>
      <value>GlobalAnthropologyNotes</value>

But if that's too hard to pull off you can include them as well. Perhaps you could catch that in schematron check instead?

@mvahowe
Copy link
Contributor

mvahowe commented Feb 7, 2018

@ericpyle This seems to be one of the things that RelaxNG does make easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants