WN-MCR-Transform

This script transforms the Multilingual Central Repository (MCR) 3.0 database so that it can be loaded using the NLTK WordNet reader.

Changes from Upstream

This lingeringsocket fork includes the following changes:

Transformation tweaks to make the result compatible with extjwnl
Regenerated data based on the 2016 release of MCR 3.0 (which now includes Portuguese as well)
Added a mapping file (ili.csv) from synset 0-based line numbers to Princeton WordNet 3.0 offsets

Transforming the MCR 3.0 corpus

The result of the transformation is in each of the compressed files here, corresponding to the available languages in MCR, so you can directly download and use them. If you want to generate them yourself, do the following.

Download the needed files:

The MCR 3.0 corpus
The NLTK toolkit
The WordNet corpus for NLTK, which can be downloaded using nltk.download() after NLTK is installed.

Then:

Find your WordNet 3.0 database files. They can be typically found in ~/nltk_data/corpora/wordnet, but the location of these files might vary in your installation. We will refer to the WordNet 3.0 database folder as WORDNET_EN_ROOT. We will refer to folder you just created, where the transformed files will be, as RESULT_ROOT.
Extract the MCR 3.0 files to a folder. This will create a series of folders containing WordNet versions for different languages. We will refer to this folder as MCR_ROOT.
Run the command:

$ ./generate_all.sh <path to MCR_ROOT> <path to WORDNET_EN_ROOT>

Note: this script will call transform.py python script, in which python 2.7.x and python 3.x versions are supported.

After step 3, the data.* and index.* files contained in RESULT_ROOT will be replaced with new versions containing the MCR 3.0 information for the desired language. The available languages so far in MCR and their codes are:

Catalan - cat
English - eng
Euskara - eus
Galician - glg
Spanish - spa

Using the transformed MCR 3.0 corpus

With Python:

import nltk
wncr = nltk.corpus.reader.wordnet.WordNetCorpusReader(<path to RESULT_ROOT>, None)
print(wncr.synset("entidad.n.01").definition)

Exporting and importing the glosses

MCR is a work in progress and not all the contents have been fully translated. This is specially true about glosses, for example only around 15% of the glosses in the Spanish version have been translated. For some applications this might be an issue. However, if you have another source where you can get the glosses in your language (for example using a machine translation process) you can import that data so it can be merged with the MCR 3.0 during the transformation.

In a Python shell, import the transform module
Execute the following Python command:

transform.export_glosses(<path to WORDNET_EN_ROOT>, <path to EN_GLOSSES_FILE>)

This creates the file EN_GLOSSES_FILE, which will contain the English glosses for all synsets. The file format is straightforward. Each line contains the gloss for one synset in this format: | , where is a concatenation of the offset in the WordNet 3.0 database file and the part of speech of the synset. For example, the synset corresponding to "entity" in English has this line: 00001740n | that which is perceived...

Translate the glosses using any means you can. As long as the format is honored and the identifiers are kept, the process will be able to get the translated glosses and merge them with the rest of the data. We will assume that you have created a new file TRANSLATED_GLOSSES_FILE containing the translated glosses.
Execute the following shell command:

./transform.py <path to MCR_ROOT> <path to WORDNET_EN_ROOT> <LANGUAGE> <path to RESULT_ROOT> <path to TRANSLATED_GLOSSES_FILE>

Description of the process and limitations

The transformation process is straightforward. The synsets are loaded, variants and relations files from MCR. Then that information is used to create data and index files respecting the constraints of the WordNet database files.

In particular, we must respect the constraint that the synset with numeric id XX must start in the offset XX of the data file. This is mandatory in order to use the NLTK WordNet reader, as it uses this fact to speed up the queries. However, this has an unfortunate consequence: the numeric id of the English and the transformed MCR files do not match. For example, the synset for "dog.n.01" has the offset 02084071 but the synset for "perro.n.01" has the offset 01295142 in the Spanish MCR. This also means that it is no longer possible to match "perro.n.01" back to its MCR identifier spa-30-02084071-n, which points to the English file offset.

The definition of relations in MCR and WordNet differ. In MCR there are more relations defined, and it is not always easy to know which corresponds to which. However, the most usual relations (such as hyponym/hypernym and meronym/holonym) have been correctly mapped. The mapping of MCR relations that are transformed to WordNet relations follows (note that some of the mappings could be wrong and change in the future):

MCR Id	MCR Name LR	MCR Name RL	WN LR	WN Name LR	WN RL	WN Name RL
1	be_in_state	state_of	=	Attribute	=	Attribute
2	causes	is_caused_by	>	cause
4	has_derived	is_derived_from	\	derived from adjective
6	has_holo_madeof	has_mero_madeof	#s	substance holonym	%s	substance meronym
7	has_holo_member	has_mero_member	#m	member holonym	%m	member meronym
8	has_holo_part	has_mero_part	#p	part holonym	%p	part meronym
12	has_hyponym	has_hyperonym	~	hyponym	@	hypernym
19	has_subevent	is_subevent_of	*	entailment
33	near_antonym		!	antonym	!	antonym
34	near_synonym		&	similar	&	similar
49	see_also_wn15		^	also see
52	verb_group		$	verb group	$	verb group
63	category_term	category	-c	member - topic	;c	domain - region
64	related_to		+	deriv. related form	+	deriv. related form
66	region_term	region	-r	member - region	;r	domain - region
68	usage_term	usage	-u	member - usage	;u	domain - usage

There is a difference between the way hypernyms are defined in MCR and WordNet. Also, in the original WordNet the antonym relation holds between two lemmas (the NLTK corpus reader browses the antonyms this way), while in MCR the relation is between synsets. Because of this, we consider that an antonym relation between synsets S1 and S2 in MCR will correspond, in the transformed version, to a set of antonym relations between lemmas L1 and L2, for all L1 in S1 and all L2 in S2.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.gitignore		.gitignore
README.md		README.md
generate_all.sh		generate_all.sh
test_transform.py		test_transform.py
transform.py		transform.py
wordnet_cat.tar.gz		wordnet_cat.tar.gz
wordnet_eng.tar.gz		wordnet_eng.tar.gz
wordnet_eus.tar.gz		wordnet_eus.tar.gz
wordnet_glg.tar.gz		wordnet_glg.tar.gz
wordnet_por.tar.gz		wordnet_por.tar.gz
wordnet_spa.tar.gz		wordnet_spa.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WN-MCR-Transform

Changes from Upstream

Transforming the MCR 3.0 corpus

Using the transformed MCR 3.0 corpus

Exporting and importing the glosses

Description of the process and limitations

About

Releases

Packages

Languages

lingeringsocket/wn-mcr-transform

Folders and files

Latest commit

History

Repository files navigation

WN-MCR-Transform

Changes from Upstream

Transforming the MCR 3.0 corpus

Using the transformed MCR 3.0 corpus

Exporting and importing the glosses

Description of the process and limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages