-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged synsets are lost in translation #179
Comments
Thanks, I think I see the problem, but let me make sure I got it right: there is a gap in ILI-based translation coverage when the target synset (and thus its ILI) has been merged into another. In this case, PWN 3.0 (and OMW lexicons expanded from it) have two synsets, but in PWN 3.1 and OEWN they are merged into a single synset.
There seems to be a mistaken assumption here. Wn does not use the ILI mappings that you are referring to. The only resource from https://github.com/globalwordnet/cili/ that it uses (and only if you've downloaded it) is the released CILI inventory which includes the ILI identifiers and definitions. Inter-lexicon relationships via shared ILIs are identified only by the Therefore, I disagree that there is something here incorrect in Wn, but I do recognize how things could be improved. A satisfactory solution to this issue is thus not so much a bug fix as a new feature: to store (or identify) and subsequently use changes to synset-ILI mappings across versions. This sounds appealing but I also feel like it will be hard to do correctly in a transparent fashion (e.g., when calling
You mean to look for senses with the same sense keys across lexicons? That might work to build the merge-mapping yourself, but it wouldn't be a solution in general because senses link synsets to words and therefore non-English lexicons should have different sense keys (but more likely they do not have them at all). Here's how you could build the mapping: >>> import wn
>>> en30 = wn.Wordnet('omw-en')
>>> en31 = wn.Wordnet('omw-en31')
>>> en31_sensekey_ili_map = {
... s.metadata()['identifier']: s.synset().ili
... for s in en31.senses()
... }
>>> en30_31_ilis = {ss.ili.id: set() for ss in en30.synsets()}
>>> for s in en30.senses():
... ili = en31_sensekey_ili_map.get(s.metadata()['identifier'])
... if ili:
... en30_31_ilis[s.synset().ili.id].add(ili.id)
...
>>> en30_31_ilis['i37881']
{'i37882'}
>>> en30_31_ilis['i37882']
{'i37882'} This mapping is unidirectional, PWN 3.0 to PWN 3.1, but maybe it is useful nonetheless. |
Thanks @goodmami, I have corrected the formulation, since I don't want to imply that something is wrong with Wn. On the other hand, there is a problem in Wn, due to the way that the CILI mappings are applied, but I realize that this happens in OMW-data, when building the LMF databases. |
@ekaf I think the CILI as a resource is better thought of as the inventory of identifiers and their definitions than as a collection of mappings. The mappings to synsets should be maintained by the respective wordnet projects, although in practice we keep some mapping files in the CILI repository. Those mappings files are used when creating the WN-LMF exports of the PWN. Let me try to describe ILI support in Wn. WN-LMF lexicons can link synsets to individual ILIs like this (example from OEWN 2021): <Synset id="oewn-15307914-n" ili="i117563" members="oewn-speed-n oewn-velocity-n" partOfSpeech="n" dc:subject="noun.time">
~~~~~~~~~~~~~ These ILIs are stored in Wn's database linked to the synsets. When a second lexicon is loaded containing synsets with the same ILIs, such as this (from OMW 1.4's Spanish wordnet): <Synset id="omw-es-15282696-n" ili="i117563" partOfSpeech="n" members="omw-es-velocidad-15282696-n" />
~~~~~~~~~~~~~ ... then Wn is able to use the shared ILI to link the synsets across lexicons for translation or expanded relation traversal. Another thing we see is synsets with the special ILI <Synset id="oewn-90002921-n" ili="in" members="oewn-snow_day-n" partOfSpeech="n" dc:subject="noun.time" dc:source="Colloquial WordNet">
~~~~~~~~ These proposed ILIs are not used for translation or expanded relation traversals. In Wn, the ILIs are represented by a class with an id, a status, and a definition. For example (here, the >>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> oewn.synsets('velocity')[0].ili.id # an explicit ID
'i117563'
>>> oewn.synsets('velocity')[0].ili.status
'presupposed'
>>> oewn.synsets('velocity')[0].ili.definition()
>>> oewn.synsets('snow day')[0].ili.id # ili="in" is special and the ID is None in Wn
>>> oewn.synsets('snow day')[0].ili.status
'proposed'
>>> oewn.synsets('snow day')[0].ili.definition()
'a day on which school or other events are cancelled due to snow' Note:
When the >>> wn.download('cili')
...
>>> oewn.synsets('velocity')[0].ili.status
'active'
>>> oewn.synsets('velocity')[0].ili.definition()
'distance travelled per unit time' The Does that help? |
Thanks @goodmami, yes your explanations help a lot indeed.
Now, the mapping can provide a translation for this Finnish synset, which has none using Wn's translate() function.
So in Wn at present, we have to go through sense-key mappings in order to avoid this problem. I suppose there could be a more direct way to use the CILI mappings, without necessarily losing synsets in the translation, since CILI contains information about the merged synsets. But even then, it remains to be seen whether ILI mappings can match the performance of sense-key mappings. |
As @goodmami wrote:
Yes, the inverse problem is that currently, when translating in the opposite direction, Wn only returns one of the merged synsets:
In that case, the complete translation would be the union of the senses belonging to all the synsets obtained by reversing the ilimap from above:
|
@ekaf Can you remind me what is the expected fix here? Currently I'm leaning toward saying this is a data challenge (best solved with documentation) and not a bug or missing feature in the code, but maybe you have something in mind that would be appropriate for this library. |
There are relatively few (around 30) merged synsets between each English Wordnet version, so losing 30 synsets in translation may not seem a huge problem. However, it is not solved with documentation alone, and a solution in the library appears more helpul. Since version 3.6.6 (see nltk/#2889), NLTK's wordnet.py library produces a sense-key based mapping "on the fly", at load time, preventing this problem from ever occurring. A similar approach can work in Wn, using code like in the comment above. An alternative could be if the ILI project also produces lists of merged synsets, with one (or more) synset(s) deprecated and linked to a target synset. This approach is less versatile, because each future English Wordnet needs a separate list of deprecations: you would have to wait for such lists to be produced, then rely on their adequacy, and still need additional code to interpret the deprecations in Wn. |
@ekaf thank you for explaining. I'm not entirely sold on this solution because it encodes lexicon-specific information (the sense keys and where they are stored), which are really only relevant for the English wordnets, and I strive as much as possible for Wn to not favor any particular wordnet or language (with the exception of the included Morphy lemmatizer). That said, so many wordnets are based on the English structure that it might make sense for practicality to beat purity here. The ILI solution would be more "pure", but, as you describe, that approach has other issues. @fcbond, I'd like to get your perspective. Should Wn codify English-specific workarounds for merged synsets across wordnet versions? Or maybe the problem is rare enough that some documentation of the problem with a recipe for getting around it would suffice? |
Describe the bug
Wn loses some merged synsets in translation, even though the original CILI mappings correctly link the merged source synsets to the same target synset.
To Reproduce
For exemple, consider these two synsets in the ili-map-pwn31.tab mapping, which map to the same PWN 3.1 target:
With Wn, the first synset (i37881) has no translation in OEWN, although it should, if i37881 was mapped to i37882:
i37881, [Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')], []
So the translation above is just the empty list ([]).
By contrast, the other merged synset translates correctly:
The same problem occurs with any other merged synsets.
Expected behavior
The first synset (i37881) would have a translation in OEWN, if the CILI mapping was used as intended.
Environment
Python 3.9.2
Wn 0.9.2
oewn 2021 [en] Open English WordNet
omw-en 1.4 [en] OMW English Wordnet based on WordNet 3.0
omw-cmn 1.4 [cmn-Hans] Chinese Open Wordnet
omw-es 1.4 [es] Multilingual Central Repository (Spanish)
omw-lt 1.4 [lt] Lithuanian WordNet
omw-pt 1.4 [pt] OpenWN-PT
omw-id 1.4 [id] Wordnet Bahasa (Indonesian)
omw-he 1.4 [he] Hebrew Wordnet
omw-eu 1.4 [eu] Multilingual Central Repository (Basque)
omw-sq 1.4 [sq] Albanet
omw-zsm 1.4 [zsm] Wordnet Bahasa (Malaysian)
omw-arb 1.4 [arb] Arabic WordNet (AWN v2)
omw-ca 1.4 [ca] Multilingual Central Repository (Catalan)
omw-fi 1.4 [fi] FinnWordNet
omw-sv 1.4 [sv] WordNet-SALDO
omw-gl 1.4 [gl] Multilingual Central Repository (Galician)
omw-el 1.4 [el] Greek Wordnet
omw-pl 1.4 [pl] plWordNet
omw-iwn 1.4 [it] ItalWordNet
omw-ro 1.4 [ro] Romanian Wordnet
omw-nl 1.4 [nl] Open Dutch WordNet
omw-ja 1.4 [ja] Japanese Wordnet
omw-fr 1.4 [fr] WOLF (Wordnet Libre du Français)
omw-sk 1.4 [sk] Slovak WordNet
omw-is 1.4 [is] IceWordNet
omw-it 1.4 [it] MultiWordNet (Italian)
omw-hr 1.4 [hr] Croatian Wordnet
omw-th 1.4 [th] Thai Wordnet
omw-bg 1.4 [bg] BulTreeBank Wordnet (BTB-WN)
omw-nb 1.4 [nb] Norwegian Wordnet (Bokmål)
omw-da 1.4 [da] DanNet
omw-nn 1.4 [nn] Norwegian Wordnet (Nynorsk)
omw-sl 1.4 [sl] sloWNet
Additional Context
At this moment, using the PWN sense keys for translation seems to be the only way to bypass the problem in Wn. However, this is not easy, since a rather big detour is necessary to obtain the sense keys in the 'oewn' lexicon.
The text was updated successfully, but these errors were encountered: