datasets & results used when writing "Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries" (paper at LREC-2016)
This directory contains 'cycle computation' for all English nouns in the Apertium RDF data:
EN-nouns.txt: 15,630 English nouns taken from the Apertium RDF data. (multiwords removed)
EN-dict.txt: 15,630 English nouns + their context (set of translation pairs).
Targets-EN.txt: 24,356 Potential Targets generated by cycle computation + some figures (see below)
getData.py: Python script used to get gata from Apertium RDF SPARQL server (http://linguistic.linkeddata.es/sparql)
calculateCycles.py: Python script used for cycle calculation (used to generate Targets-EN.txt)
ApertiumRDF-GraphContexts.ipynb: ipynb notebook document 'analysing' the context graphs
ApertiumRDF-PotentialTargtes.ipynb: ipynb notebook document 'analysing' the generated Potential Targets
Targets-EN.txt file is a csv file with:
Word: the source English word.
Cycles: the number of cycles containing source & target words.
Uniq Cycles: number of 'unique' cycles with source & target words (abcda = acdba).
Nodes: number of nodes in the Word's graph (the local context for Word).
Edges: number of edges in the Word's graph (the local context for Word).
Known Targets: number of already known targets for Word in the Apertium data.
Potential TargtespT: number of potential targets for Word (nodes in cycles not linked to Word).
Graph Density: graph density (density of the context).
Potential Target: the potential target.
Lan: indicates whether there is another Target word with same language.
Score: the cycle's density.
InC: the number of cycles the Target word occurs in with the same score.
length: the length of the cycle.
$python get.Data.py en EN-nouns.txt > EN-dict.txt (*)
$python calculateCycles.py EN-dict.txt en v > Targets-EN.txt
(*) getData.py generates one dict for each input word, you need to 'join' all dicts into a single one before running calculateCycles.py script.