Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate for empty or ill-formatted definitions and examples #151

Open
fcbond opened this issue Nov 6, 2021 · 5 comments
Open

Validate for empty or ill-formatted definitions and examples #151

fcbond opened this issue Nov 6, 2021 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@fcbond
Copy link
Collaborator

fcbond commented Nov 6, 2021

The MCR wordnet candidate had some interesting issues with definitions, although they probably apply more broadly (definitely to examples). I don't think these are bugs, but possibly something we should add a warning for? I think there are are two issues, neither of which are illegal XML.

  1. Definition contains only whitespace or is empty:
		<Synset id="spa-30-80001224-n" ili="ili-30-80001224-n">
			<Definition>
				
			</Definition>
		</Synset>

Here we should warn something like 'Definition for synset {ID} is empty, better to omit'.

  1. Definition has whitespace before and after:
<Synset id="spa-30-80001223-n" ili="ili-30-80001223-n">
			<Definition>
				Pequeña malformación que causa la dilatación y fragilidad vascular del colon, dando como resultado una pérdida intermitente de sangre desde el tracto intestinal.
			</Definition>
		</Synset>

Maybe here we should warn something like 'Definition for synset {ID} contains unnecessary whitespace'.

@jmccrae should we add documentation to the https://github.com/globalwordnet/schemas saying the best practice is not to pad, and to omit empty definitions (and examples), or is this too obvious?

@goodmami should we strip the text of padding before adding it to the database?

@fcbond fcbond added enhancement New feature or request good first issue Good for newcomers labels Nov 7, 2021
@goodmami
Copy link
Owner

goodmami commented Nov 7, 2021

Are you talking about warning during validation or during normal use of Wn? To me, the former seems acceptable but not the latter, as this is just bad formatting and not something that makes the data less correct or usable.

In Wn, I try to store in the database an accurate representation of what was in the WN-LMF file, such that exporting the data would result in an equivalent WN-LMF file, so I don't think stripping the definitions is a good solution. However, it would be fine with me if OMW wanted to fix these things during the compilation of its wordnets.

@fcbond
Copy link
Collaborator Author

fcbond commented Nov 7, 2021 via email

@goodmami goodmami changed the title How to deal with whitespace padding? Validate for empty or ill-formatted definitions and examples Nov 7, 2021
@goodmami
Copy link
Owner

goodmami commented Nov 7, 2021

I was indeed thinking of warning during validation.

Ok, good. It wasn't clear, so I changed the title. We could also check for similar whitespace issues in other elements like <ILIDefinition>, <Count>, <Tag>, and <Pronunciation>, or in attribute values, like for writtenForm or subcategorizationFrame.

I think returning \n\t\t\n\t\t\n for the definition of a synset, rather
than None, is less correct and does make it less usable. However, as you
say, the ideal time to catch this is when the wordnet is made, not when we
load.

Right. It's less correct for the language, but it's an accurate representation of what's in the data. I don't think Wn should be deciding what it thinks a language should look like. The data should do that.

@francis-dion
Copy link

My understanding is that, in XML, white space after the opening tag and before the closing tag should be ignored.
I didn't trace the original specs, but found multiple references including this one from adobe:
XML ignores the first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag. XML translates non-space characters (tab and new-line) into a space character and consolidates all multiple space characters into a single space

If the author of a wordnet wants/needs white space preserved, they should use the xml:space attribute. Here's a quote from O'Reilly's xml pocket reference:
When xml:space is used on an element with a value of preserve , the whitespace in that element's content must be preserved as is by the application that processes it. The whitespace is always passed on to the processing application, but xml:space provides the application with a hint regarding how to process it.

Otherwise, I believe leading/trailing white space should definitively be stripped. I also think (albeit less strongly :-) that wn should be translating non-space characters (tab and new-line) into a space character and consolidate all multiple space characters into a single space.

@goodmami
Copy link
Owner

Thanks, @francis-dion, that's a good point. I'd forgotten about xml:space. The W3 spec says this about the default value:

The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space.

So when xml:space is not specified, it's not that the spacing should be stripped, but that the application should use its default whitespace processsing. So, yes, Wn could strip (and normalize) whitespace if xml:space is not present. One issue is if a wordnet author wishes to preserve whitespace. Obviously the answer is to use xml:space on the element, but the WN-LMF spec needs to declare the attribute for it to be used. From the same W3 spec:

In valid documents, this attribute, like any other, MUST be declared if it is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants