Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing word exceptional forms #175

Closed
fredsonaguiar opened this issue Jun 27, 2021 · 14 comments
Closed

missing word exceptional forms #175

fredsonaguiar opened this issue Jun 27, 2021 · 14 comments

Comments

@fredsonaguiar
Copy link

In https://github.com/bond-lab/omw-data/blob/9f2df85bbbab39370e265a2e2d90d95b6d015f04/wns/pwn30/wn30.xml.xz, one can find items Form describing irregular inflections of some words, such as ramus-rami.

Just reporting for now; this kind information isn't present here, and might be useful in the future.

@arademaker
Copy link
Member

The *.exc files from the PWN 3.0 distribution were not used in our code to generate the RDF representation that we used (based on https://www.w3.org/TR/wordnet-rdf/). But I think we can easily add them to the OWN-EN RDF using an extra data property Word -> String.

@arademaker arademaker added this to the pre release 1.0 milestone Jun 28, 2021
@arademaker
Copy link
Member

Documentation about those files:

  1. https://wordnet.princeton.edu/documentation/morphy7wn
  2. https://wordnet.princeton.edu/documentation/wndb5wn see Exception List File Format

@fredsonaguiar
Copy link
Author

While doing so, I learned that not all exceptions described in the .exc files could be mapped. For instances, yclept clepe, upswollen upswell and underpropped underprop were not mapped, along with others, totalizing 1254 cases.

It occurs because the target lemmas couldn't be found defined in the wordnet. In the examples, the forms clepe, upswell and underprop were not found defined in OWN-EN.

@fredsonaguiar fredsonaguiar changed the title Missing Forms missing word exceptional forms Jun 29, 2021
@fredsonaguiar
Copy link
Author

We will be able to close this only after #177 referenced.

@fredsonaguiar
Copy link
Author

fredsonaguiar commented Jul 1, 2021

In 7e54978 we fix that, considering new words with property wn30:pos. We do so by running this script.

Running and outputs:

python3 pyownpt/cli/morpho_exceptions.py own-files/own-en-words.ttl WordNet-3.0/dict/ -o own-en-words.ttl -v
INFO:root:loading data from file 'openWordnet-PT/own-files/own-en-words.ttl'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/adj.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/adv.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/noun.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/verb.exc'
INFO:ownpt:processing 1490 exceptions with pos 'a'
INFO:ownpt:processing 7 exceptions with pos 'r'
INFO:ownpt:processing 2054 exceptions with pos 'n'
INFO:ownpt:processing 2401 exceptions with pos 'v'
INFO:ownpt:action applied to 6053 cases
INFO:ownpt:action applied to 6053 cases
	total: 4464 triples added
	total: 4467 exceptions processed
	total: 1586 exceptions not processed
INFO:ownpt:after action, 4464 triples were added
INFO:root:serializing output to 'own-en-words.ttl'

@arademaker
Copy link
Member

The number of exceptions in the output is different from the previous comment?

Can you list here one example of the result? I could not see the new file, but I am expecting that own-en and pwn-pt now have words like

word-dog-v
word-dog-n

Is it right? How the exceptions were added?

@arademaker
Copy link
Member

arademaker commented Jul 1, 2021

We have used so far lexicalForm (https://github.com/own-pt/openWordnet-PT/blob/master/wn30.ttl#L493) but we have

  1. https://github.com/globalwordnet/schemas/blob/master/example.ttl#L106. So a lexicalEntry has a canonicalForm and otherForm. Both as entities that have writtenRep.
  2. https://github.com/globalwordnet/schemas/blob/master/example.xml#L127. Here a lexicalEntry has a lemma and one or more Form. Both with the writtenForm attribute. But DTD 1.1 globalwordnet/schemas#52?

Considering the current one (first below), I think we can't use lexicalForm anymore because the exceptionalForm is a lexicalForm too. Besides that, exceptional means unusual or outstanding. it makes sense for the original PWN if all other regular inflections are considered the normal usual or normal ones. So dogs is not exceptional and it is produced automatically, the *.exc contains only the unusual forms. But in the RDF, canonical vs other or lemma vs other may be more informative as a property of a Word?

<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
    wn30:exceptionalForm "beeves"@en ;
    wn30:lexicalForm "beef"@en ;
    wn30:pos "n" .

<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
    wn30:otherForm "beeves"@en ;
    wn30:canonicalForm "beef"@en ;
    wn30:pos "n" .

<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
    wn30:otherForm "beeves"@en ;
    wn30:lemma "beef"@en ;
    wn30:pos "n" .

@arademaker arademaker reopened this Jul 1, 2021
@arademaker
Copy link
Member

Just to make sure I got your inputs and we make a decision about the properties' names.

@fredsonaguiar
Copy link
Author

Sure. It's important to have informative names to the properties. In https://wordnet.princeton.edu/documentation/wndb5wn, were the .exc files are described, they first describe:

noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists

I mean, although those exceptions are natural in the language, and should be understood simply as other forms, those cases are still exceptional in the morphological sense. This way of thinking may justify the description above.

@arademaker
Copy link
Member

Let us use for now the LMF DTD as reference: wn30:lemma and wn30:altForm (from the https://www.w3.org/TR/swbp-skos-core-spec/#altLabel) to the exceptions. Does it work for you? If not, wn30:otherForm works? If so, I would be fine with any of those.

@arademaker
Copy link
Member

Please data need to be fixed and the wn30.ttl vocabulary too.

@fredsonaguiar
Copy link
Author

The number of exceptions in the output is different from the previous comment?

Yes. It was expected that after, in 156d2e1, considering parts-of-speech to add those morphological exceptions the quantity of exceptions not processed would be greater; or at least the same as before.

In the first case, not considering pos, we had 1254 cases not applied. After considering pos, we had 1586 cases not applied. Checking, the new 332 cases because even if there is a Word with the suitable lemma, it is not granted to have the suitable pos too.

For instance: The exception wildcatting wildcat from verb.exc, was applied before we had information about pos. After splitting words into pos it was not applied, because the word word-wildcat-v is not defined; only the word word-wildcat-n.

@fredsonaguiar
Copy link
Author

Can you list here one example of the result?

Sure. In #177, we discuss expanding words, with a new property wn30:pos. Please take a look in #177 (comment).

After that, comes the #175 (comment). We consider the property wn30:pos to decide the word to apply a new exception information.

For instance: for the exception zipping zip from file verb.exc, we search a word, with wn30:lexicalForm "zip", and wn30:pos "v". Once it's found, we add a triple:

<https://w3id.org/own-pt/wn30-en/instances/word-zip-v> wn30:exceptionalForm "zipping"@en

Another example: for the exception wildcatting wildcat from file verb.exc, we search a word, with wn30:lexicalForm "wildcat", and wn30:pos "v". If none is found, we send a WARN:

WARNING:ownpt:could not process exception:v: wildcatting wildcat

@fredsonaguiar
Copy link
Author

fredsonaguiar commented Jul 1, 2021

Please data need to be fixed and the wn30.ttl vocabulary too.

We use sed, changing wn30:lexicalForm -> wn30:lemma and wn30:exceptionalForm -> wn30:otherForm:

sed "s/wn30:lexicalForm/wn30:lemma/g" -i wn30.ttl own-files/*
sed "s/wn30:exceptionalForm/wn30:otherForm/g" -i wn30.ttl own-files/*

The alterations are in 7c9cd93 and 1a7e215.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants