Validate for empty or ill-formatted definitions and examples #151

fcbond · 2021-11-06T07:24:04Z

The MCR wordnet candidate had some interesting issues with definitions, although they probably apply more broadly (definitely to examples). I don't think these are bugs, but possibly something we should add a warning for? I think there are are two issues, neither of which are illegal XML.

Definition contains only whitespace or is empty:

		<Synset id="spa-30-80001224-n" ili="ili-30-80001224-n">
			<Definition>
				
			</Definition>
		</Synset>

Here we should warn something like 'Definition for synset {ID} is empty, better to omit'.

Definition has whitespace before and after:

<Synset id="spa-30-80001223-n" ili="ili-30-80001223-n">
			<Definition>
				Pequeña malformación que causa la dilatación y fragilidad vascular del colon, dando como resultado una pérdida intermitente de sangre desde el tracto intestinal.
			</Definition>
		</Synset>

Maybe here we should warn something like 'Definition for synset {ID} contains unnecessary whitespace'.

@jmccrae should we add documentation to the https://github.com/globalwordnet/schemas saying the best practice is not to pad, and to omit empty definitions (and examples), or is this too obvious?

@goodmami should we strip the text of padding before adding it to the database?

The text was updated successfully, but these errors were encountered:

goodmami · 2021-11-07T04:27:04Z

Are you talking about warning during validation or during normal use of Wn? To me, the former seems acceptable but not the latter, as this is just bad formatting and not something that makes the data less correct or usable.

In Wn, I try to store in the database an accurate representation of what was in the WN-LMF file, such that exporting the data would result in an equivalent WN-LMF file, so I don't think stripping the definitions is a good solution. However, it would be fine with me if OMW wanted to fix these things during the compilation of its wordnets.

fcbond · 2021-11-07T05:16:44Z

Hi, I was indeed thinking of warning during validation. I think returning \n\t\t\n\t\t\n for the definition of a synset, rather than None, is less correct and does make it less usable. However, as you say, the ideal time to catch this is when the wordnet is made, not when we load.

…

On Sun, Nov 7, 2021 at 12:27 PM Michael Wayne Goodman < ***@***.***> wrote: Are you talking about warning during validation or during normal use of Wn? To me, the former seems acceptable but not the latter, as this is just bad formatting and not something that makes the data less correct or usable. In Wn, I try to store in the database an accurate representation of what was in the WN-LMF file, such that exporting the data would result in an equivalent WN-LMF file, so I don't think stripping the definitions is a good solution. However, it would be fine with me if OMW wanted to fix these things during the compilation of its wordnets. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#151 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRVZR6QDQUA3T2FW7MLUKYE3FANCNFSM5HPJT54Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami · 2021-11-07T16:38:53Z

I was indeed thinking of warning during validation.

Ok, good. It wasn't clear, so I changed the title. We could also check for similar whitespace issues in other elements like <ILIDefinition>, <Count>, <Tag>, and <Pronunciation>, or in attribute values, like for writtenForm or subcategorizationFrame.

I think returning \n\t\t\n\t\t\n for the definition of a synset, rather
than None, is less correct and does make it less usable. However, as you
say, the ideal time to catch this is when the wordnet is made, not when we
load.

Right. It's less correct for the language, but it's an accurate representation of what's in the data. I don't think Wn should be deciding what it thinks a language should look like. The data should do that.

francis-dion · 2022-05-13T15:44:30Z

My understanding is that, in XML, white space after the opening tag and before the closing tag should be ignored.
I didn't trace the original specs, but found multiple references including this one from adobe:
XML ignores the first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag. XML translates non-space characters (tab and new-line) into a space character and consolidates all multiple space characters into a single space

If the author of a wordnet wants/needs white space preserved, they should use the xml:space attribute. Here's a quote from O'Reilly's xml pocket reference:
When xml:space is used on an element with a value of preserve , the whitespace in that element's content must be preserved as is by the application that processes it. The whitespace is always passed on to the processing application, but xml:space provides the application with a hint regarding how to process it.

Otherwise, I believe leading/trailing white space should definitively be stripped. I also think (albeit less strongly :-) that wn should be translating non-space characters (tab and new-line) into a space character and consolidate all multiple space characters into a single space.

goodmami · 2022-05-24T04:57:39Z

Thanks, @francis-dion, that's a good point. I'd forgotten about xml:space. The W3 spec says this about the default value:

The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space.

So when xml:space is not specified, it's not that the spacing should be stripped, but that the application should use its default whitespace processsing. So, yes, Wn could strip (and normalize) whitespace if xml:space is not present. One issue is if a wordnet author wishes to preserve whitespace. Obviously the answer is to use xml:space on the element, but the WN-LMF spec needs to declare the attribute for it to be used. From the same W3 spec:

In valid documents, this attribute, like any other, MUST be declared if it is used.

fcbond added enhancement New feature or request good first issue Good for newcomers labels Nov 7, 2021

goodmami changed the title ~~How to deal with whitespace padding?~~ Validate for empty or ill-formatted definitions and examples Nov 7, 2021

noe mentioned this issue Mar 8, 2022

Missing Spanish definitions #159

Closed

goodmami mentioned this issue May 7, 2023

Add xml:space attribute for WN-LMF format globalwordnet/schemas#70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate for empty or ill-formatted definitions and examples #151

Validate for empty or ill-formatted definitions and examples #151

fcbond commented Nov 6, 2021 •

edited

Loading

goodmami commented Nov 7, 2021

fcbond commented Nov 7, 2021 via email

goodmami commented Nov 7, 2021

francis-dion commented May 13, 2022

goodmami commented May 24, 2022

Validate for empty or ill-formatted definitions and examples #151

Validate for empty or ill-formatted definitions and examples #151

Comments

fcbond commented Nov 6, 2021 • edited Loading

goodmami commented Nov 7, 2021

fcbond commented Nov 7, 2021 via email

goodmami commented Nov 7, 2021

francis-dion commented May 13, 2022

goodmami commented May 24, 2022

fcbond commented Nov 6, 2021 •

edited

Loading