-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate for empty or ill-formatted definitions and examples #151
Comments
Are you talking about warning during validation or during normal use of Wn? To me, the former seems acceptable but not the latter, as this is just bad formatting and not something that makes the data less correct or usable. In Wn, I try to store in the database an accurate representation of what was in the WN-LMF file, such that exporting the data would result in an equivalent WN-LMF file, so I don't think stripping the definitions is a good solution. However, it would be fine with me if OMW wanted to fix these things during the compilation of its wordnets. |
Hi,
I was indeed thinking of warning during validation.
I think returning \n\t\t\n\t\t\n for the definition of a synset, rather
than None, is less correct and does make it less usable. However, as you
say, the ideal time to catch this is when the wordnet is made, not when we
load.
…On Sun, Nov 7, 2021 at 12:27 PM Michael Wayne Goodman < ***@***.***> wrote:
Are you talking about warning during validation or during normal use of
Wn? To me, the former seems acceptable but not the latter, as this is just
bad formatting and not something that makes the data less correct or usable.
In Wn, I try to store in the database an accurate representation of what
was in the WN-LMF file, such that exporting the data would result in an
equivalent WN-LMF file, so I don't think stripping the definitions is a
good solution. However, it would be fine with me if OMW wanted to fix these
things during the compilation of its wordnets.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#151 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRVZR6QDQUA3T2FW7MLUKYE3FANCNFSM5HPJT54Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
Ok, good. It wasn't clear, so I changed the title. We could also check for similar whitespace issues in other elements like
Right. It's less correct for the language, but it's an accurate representation of what's in the data. I don't think Wn should be deciding what it thinks a language should look like. The data should do that. |
My understanding is that, in XML, white space after the opening tag and before the closing tag should be ignored. If the author of a wordnet wants/needs white space preserved, they should use the xml:space attribute. Here's a quote from O'Reilly's xml pocket reference: Otherwise, I believe leading/trailing white space should definitively be stripped. I also think (albeit less strongly :-) that wn should be translating non-space characters (tab and new-line) into a space character and consolidate all multiple space characters into a single space. |
Thanks, @francis-dion, that's a good point. I'd forgotten about
So when
|
The MCR wordnet candidate had some interesting issues with definitions, although they probably apply more broadly (definitely to examples). I don't think these are bugs, but possibly something we should add a warning for? I think there are are two issues, neither of which are illegal XML.
Here we should warn something like 'Definition for synset {ID} is empty, better to omit'.
Maybe here we should warn something like 'Definition for synset {ID} contains unnecessary whitespace'.
@jmccrae should we add documentation to the https://github.com/globalwordnet/schemas saying the best practice is not to pad, and to omit empty definitions (and examples), or is this too obvious?
@goodmami should we strip the text of padding before adding it to the database?
The text was updated successfully, but these errors were encountered: