Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resourceOnly variant plus validating example document #30

Merged
merged 6 commits into from
Feb 6, 2018

Conversation

mvahowe
Copy link
Contributor

@mvahowe mvahowe commented Feb 2, 2018

The valid example document is in v2/2_1_1/test_docs. The gory detail is in the 2.1.1 schema. The easiest way to explore this is to edit the example document with a validating XML editor.

I've attacked the problem from several directions:

  • In some cases I've made stuff optional for resourceOnly.

  • In some cases I've modifed the schema for all variants.

Features of note include:

  • projectType and optional basedOn under type section, which means the current paratext systemId still works. (@ericpyle, can you give me the definitive list of projectTypes?)

  • Non-canonical book codes are now valid @ROLE values. (I don't know why I hadn't done this before since I had the list of non-canonical books in the schema. Maybe I was waiting for the end of the peripheral content discussion.)

  • New, optional srcPart attribute for content elements, which provides a path to find content within a manifest entry. Currently the options are

zip://.* for something inside a zip file

xpath://.* for something inside an XML file

#* for an XML/HTML fragment id

canonicalContent is missing because resourceOnly content is not canonical (in the sense of being Scripture), even if it happens to use the same codes.

@mvahowe mvahowe requested a review from ericpyle February 2, 2018 10:54
@ericpyle
Copy link
Contributor

ericpyle commented Feb 2, 2018

Thanks, Mark. See my comments below.

@mvahowe wrote:

The valid example document is in v2/2_1_1/test_docs. The gory detail is in the 2.1.1 schema. The easiest way to explore this is to edit the example document with a validating XML editor.

I've attacked the problem from several directions:

In some cases I've made stuff optional for resourceOnly.

Should we call this type medium paratextResource? Part of me feels like we need to becareful not to make it too generic, since the licensing, authentication, and authorization is completely different. I think we should avoid the impression that a generic uploader could try to use this format for other "resource" purposes. (On the other hand, I'd also try to discourage against trying to use it trying to make a paratext resource, since we assume more goes in the .zip than just books and non-canonical books.)

In some cases I've modifed the schema for all variants.

Features of note include:

projectType and optional basedOn under type section, which means the current paratext systemId still works.

I had mentioned elsewhere that ptSystemId can just be a "text" (and now a "resourceOnly") systemId. Do you know if/why ptSystemId and ptregSystemId are in the more general medium types as well? Is that because of possible relationships to a "text" with those properties?

(@ericpyle, can you give me the definitive list of projectTypes?)

Sure. it's the same list as in #17

Standard|Daughter|StudyBible|StudyBibleAdditions|BackTranslation|Auxiliary|TransliterationManual|TransliterationWithEncoder|ConsultantNotes|GlobalConsultantNotes|GlobalAnthropologyNotes

Non-canonical book codes are now valid @ROLE values. (I don't know why I hadn't done this before since I had the list of non-canonical books in the schema. Maybe I was waiting for the end of the peripheral content discussion.)

Perhaps we didn't want to allow those roles for regular uploads? Should be okay as long as usx validation will catch any attempts to upload peripherals for now.

New, optional srcPart attribute for content elements, which provides a path to find content within a manifest entry. Currently the options are

zip://.* for something inside a zip file

Okay. I'd still favor a new element under source for the purpose of listing books in the resource, since historically publications comes into play for type/isExpresssion=true and is more oriented toward LCH licensing, etc., whereas the source element is more opaque and related to knowledge of the archive. Also, since we dont' have plans to have multiple publications or divisions or need for names, we could make it a simple list (like canonicalContents). I don't think it's worth fighting over this, so if you really feel strongly about having it under pubications go for it.

@smorrison: have any opinions on this?

xpath://.* for something inside an XML file

#* for an XML/HTML fragment id

It's my understanding that @bryceallison advised against fragment ids. He'd rather we just break them into separate files.

canonicalContent is missing because resourceOnly content is not canonical (in the sense of being Scripture), even if it happens to use the same codes.

I don't have a strong feeling about keeping this or not. I'm certainly okay with populating canonicalContent with canonical books included in the source.zip to be consistent with how it's done in other entries.

<archivistName>Lois Gourley</archivistName>
<dateArchived>2017-07-19T21:36:30.284353</dateArchived>
<dateUpdated>2017-11-01T17:23:53.336625</dateUpdated>
<comments>Need a comment here</comments>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvahowe please make "comments" optional.

</language>
<manifest>
<container uri="source">
<container uri="p8z-manifests">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvahowe you can remove "p8z-manifests" and its contents. It's debris from web2py pt resource download behavior for debug purposes (since this keeps creeping into manifests, I'm guessing I should remove that behavior).

@mvahowe
Copy link
Contributor Author

mvahowe commented Feb 5, 2018

@ericpyle I've made another commit addressing some of your concerns.

I hear what you're saying about resourceOnly, but I just can't bring myself to create a fundamental "type" in DBL that is explicitly tied to one particular client technology. In practice I don't think anyone is going to try to "hack" resources into PT this way and, also, if that's really a concern, the issue we need to address is server-side security. ie, if Joe DBL User is not supposed to be able to create these things, DBL ought to enforce that restriction.

I left projectType in type where I think it belongs and added the enum. I put the 2 basedOn elements as optional ptId subelements (without a basedOn wrapper). In general, we haven't tried to work out which types of systemId can appear with which types of entry and I'm inclined not to start now.

Thank-you for humouring me re publications. I think this is the least strange way to handle this data. From my perspective (which has DBL at the centre and PT as one possible HTTP client), resourceOnly entries are, in a sense, published on the PT platform and, as I understand your explanations, the bookList controls which of the available resources are "published". I really don't want to create a precedent for DBL entries where we're acting as a glorified cloud backup solution for opaque, binary data. And I think the srcPart mechanism may be useful for other entry types in the future.

I totally agree with @bryceallison that, in an ideal world, we'd have one reference-able "thing" per file. That's why, last week, I was arguing that Paratext should give us the individual SFM files and not a zip. :-) But I understand why you don't want to do that, and I can imagine others making similar arguments about other media types in the future. I think the fragment identifier proposal came from me, but it was in reaction to an earlier proposal to use xpaths as ids. I think xpaths are awesome for finding things but utterly unsuitable as ids because there are literally an infinite number of xpaths to reach any particular node. So, if/when we need to label parts of an XML or HTML document, I think a fragment identifier is preferable to an xpath. But I'd be happy for us to never do either.

I can make comments optional if we really need to do this, but it seems like another departure from DBL practice that just tends to make things more opaque. I think I'd prefer a boilerplate comment that might one day turn into something informative. (Presumably these things do get updated, fixed, etc, from time to time?)

Finally, the schema doesn't require or forbid any manifest content so, if you don't want p8z-manifests to appear, you just need to stop uploading it. If you do upload it, it's going to appear in the manifest. Making a special case for that file would mean hacks right across DBL.

@ericpyle
Copy link
Contributor

ericpyle commented Feb 5, 2018

@mvahowe wrote:

@ericpyle I've made another commit addressing some of your concerns.

I hear what you're saying about resourceOnly, but I just can't bring myself to create a fundamental "type" in DBL that is explicitly tied to one particular client technology.

@mvahowe, you were the one who taught be about the need to tell clients the truth. :) (Btw, At some point we continued to call our metadata <DBLMetadata> although it has other "export this to Disk" clients/consumers as well, like Glyssen). Licensing, authentication, and authorization are all PT ecosystem oriented for this type, and there are no plans to change that, as far as I can see. On the other hand, there are "open-access" entries for which pt resources could theoretically be accessed and downloaded but that zip is password protected and so they couldn't even read the SFM if they wanted to. And there is no way for those public users to access the resourceOnly metadata.xml since they are confidential. Owners could get access to the metadata.xml and source.zip but we've established the likelihood of that use case is practically zero for using metadata.xml to publish/render SFM files from the source.zip.

In practice I don't think anyone is going to try to "hack" resources into PT this way and, also, if that's really a concern, the issue we need to address is server-side security. ie, if Joe DBL User is not supposed to be able to create these things, DBL ought to enforce that restriction.

I agree that it might make sense to add a server-side check, especially if that ever happens purposefully or accidentally.

I left projectType in type where I think it belongs and added the enum. I put the 2 basedOn elements as optional ptId subelements (without a basedOn wrapper). In general, we haven't tried to work out which types of systemId can appear with which types of entry and I'm inclined not to start now.

I was just observing that you have two duplicate enumerations of systemIds. one in textIdentificationElement and one in identificationElement. It seemed trivial (to me) and more "valid" to take the text oriented systemIds out of identificationElement but keep in them in textIdentificationElement. But if you'd rather be agnostic about systemIds, it's okay with me.

...I can make comments optional if we really need to do this, but it seems like another departure from DBL practice that just tends to make things more opaque. I think I'd prefer a boilerplate comment that might one day turn into something informative. (Presumably these things do get updated, fixed, etc, from time to time?)

Okay, you can keep comments required. If/when we migrate/auto-upgrade older resourceOnly data we'll need a boilerplate comment for those. I could show/enable the comment field in my pt8.1 uploader anticipating that uploads should fill in some comments. I think it would be good to have the comments for sure.

Finally, the schema doesn't require or forbid any manifest content so, if you don't want p8z-manifests to appear, you just need to stop uploading it. If you do upload it, it's going to appear in the manifest. Making a special case for that file would mean hacks right across DBL.

I think I had tried to explain that p8z-manifests fires are NEVER uploaded. It is a folder/files that get created whenever a user downloads ptResources. The auto-upgrader (and migration) scripts tends to assume they should be part of the manifest since they are in s3 (which I can't blame them).

@mvahowe
Copy link
Contributor Author

mvahowe commented Feb 5, 2018

@ericpyle The thing is, even in our last audio conversation, the position shifted from "Only PT will ever use these entries" to "There may well be other translation tools that eventually wish to use these entries but they'll have to do it the PT way." FWIW I think DBLMetadata works because DBL was, is and is likely to continue to be the curator of that schema. Apart from that, I accept that there are many things that are different about these entries, but I'm telling myself that it's a question of degree. (If I don't tell myself that, I find myself wanting to remove the entries and put the content somewhere else.)

You're right, we could make this systemId specific to resourceOnly media, but then, to be consistent, we ought to try to work out which media types can have a ptreg or dbp systemId. Ultimately we trust people to give us sensible systemIds, and even if we tighten the schema there will still be nothing to stop people making up systemIds by generating random strings, so I don't think complicating the schema gets us much extra value. But if you like I'll treat the paratext systemId as an exception.

My point about p8z-manifests is that it isn't a schema issue. There are no specific filenames in the schema. The manifests for migration and auto-upgrading are generated by Python code on the server, so it sounds to me like you need to report a bug against dbl-archive.

@ericpyle
Copy link
Contributor

ericpyle commented Feb 5, 2018

@mvahowe wrote:

@ericpyle The thing is, even in our last audio conversation, the position shifted from "Only PT will ever use these entries" to "There may well be other translation tools that eventually wish to use these entries but they'll have to do it the PT way."

You're right. It is possible in the future we'd want to allow owners to the additionally capability to deliver SFM content to non-paratext clients using a more DBL-native authentication/authorization scheme.

You're right, we could make this systemId specific to resourceOnly media, but then, to be consistent, we ought to try to work out which media types can have a ptreg or dbp systemId. Ultimately we trust people to give us sensible systemIds, and even if we tighten the schema there will still be nothing to stop people making up systemIds by generating random strings, so I don't think complicating the schema gets us much extra value. But if you like I'll treat the paratext systemId as an exception.

Not necessary, but I think you could feel confident about ptSystemId and ptregSystemId being text and resourceOnly systemIds if you did do it (the old schemas certainly grouped systemIds according to media type). gbcSystemId is also currently text only, but I know we have plans to broaden that.

My point about p8z-manifests is that it isn't a schema issue. There are no specific filenames in the schema. The manifests for migration and auto-upgrading are generated by Python code on the server, so it sounds to me like you need to report a bug against dbl-archive.

You're right that it's a DBL bug. https://github.com/ubsicap/dbl-archive/issues/293

@mvahowe
Copy link
Contributor Author

mvahowe commented Feb 6, 2018

@ericpyle We seemed to have reached the point where there was an argument for doing things either way so, since you're going to have to deal with this more often than me, I thought I'd try it your way.

Latest commit renames the type as paratextZipResource, which tells people the truth :-) and also leaves space for a more conventional paratextResource one day. I also limited the ptSystemId to this type. (Along the way I rediscovered that RelaxNG doesn't like complex logic around interleaved elements with the same name, ie we can't easily make ptSystemId required and other ids optional but, as I said yesterday, I don't think that's the biggest way for clients to mess up systemId if they really want to.)

Does this look fit for purpose now?

@ericpyle
Copy link
Contributor

ericpyle commented Feb 6, 2018

@mvahowe wrote:

@ericpyle We seemed to have reached the point where there was an argument for doing things either way so, since you're going to have to deal with this more often than me, I thought I'd try it your way.

haha. well, I had come to place where I was okay with your way, but if you're good with this, sounds okay with me. (I tend to err on the side of the YAGNI (you aint going to need it) principle of agile programming.)

Latest commit renames the type as paratextZipResource, which tells people the truth :-) and also leaves space for a more conventional paratextResource one day.

Sounds fine.

I also limited the ptSystemId to this type.

ptSystemId needs also to be in text type medium (that's the primary use case)

@mvahowe
Copy link
Contributor Author

mvahowe commented Feb 6, 2018

@ericpyle Better?

@@ -2,7 +2,7 @@
<!--
metadata.rnc v2.1.1
Generated by mark@notamacbook using makeRNC.py
Generated Tue Feb 6 15:47:19 2018
Generated Tue Feb 6 17:14:01 2018
-->
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvahowe as far as I can tell, you didn't actually add ptSystemId back into textIdentificationElement?

@mvahowe
Copy link
Contributor Author

mvahowe commented Feb 6, 2018

@ericpyle Indeed, and I found some other oddness such as the dbpId not being available for audio. Can you take another look, please?

@ericpyle
Copy link
Contributor

ericpyle commented Feb 6, 2018

@mvahowe wrote:

@ericpyle Indeed, and I found some other oddness such as the dbpId not being available for audio. Can you take another look, please?

Probably because at some point I confirmed that dbpId was no longer needed, and thus did not need to be in 2.1. But if it's there I suppose there's no harm in keeping it. I'm guessing that audio metadata migrations have dropped those.

@mvahowe
Copy link
Contributor Author

mvahowe commented Feb 6, 2018

@ericpyle I don't recall a conversation about dropping dbpId. I tried to put it back recently because we now have hundreds of DBP entries with the DBP ID in the comments section for want of a suitable systemId.

@ericpyle ericpyle merged commit a1ca974 into master Feb 6, 2018
@ericpyle ericpyle deleted the mvh/resource_only branch February 6, 2018 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants