Skip to content

Latest commit

 

History

History
49 lines (30 loc) · 2.07 KB

README.md

File metadata and controls

49 lines (30 loc) · 2.07 KB

ApertiumRDF

datasets & results used when writing "Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries" (paper at LREC-2016)

New-data

This directory contains 'cycle computation' for all English nouns in the Apertium RDF data:

EN-nouns.txt: 15,630 English nouns taken from the Apertium RDF data. (multiwords removed)

EN-dict.txt: 15,630 English nouns + their context (set of translation pairs).

Targets-EN.txt: 24,356 Potential Targets generated by cycle computation + some figures (see below)

getData.py: Python script used to get gata from Apertium RDF SPARQL server (http://linguistic.linkeddata.es/sparql)

calculateCycles.py: Python script used for cycle calculation (used to generate Targets-EN.txt)

ApertiumRDF-GraphContexts.ipynb: ipynb notebook document 'analysing' the context graphs

ApertiumRDF-PotentialTargtes.ipynb: ipynb notebook document 'analysing' the generated Potential Targets

Targets-EN.txt format

Targets-EN.txt file is a csv file with:

Word: the source English word.
Cycles: the number of cycles containing source & target words.
Uniq Cycles: number of 'unique' cycles with source & target words (abcda = acdba).
Nodes: number of nodes in the Word's graph (the local context for Word).
Edges: number of edges in the Word's graph (the local context for Word).
Known Targets: number of already known targets for Word in the Apertium data.
Potential TargtespT: number of potential targets for Word (nodes in cycles not linked to Word).
Graph Density: graph density (density of the context).
Potential Target: the potential target.
Lan: indicates whether there is another Target word with same language.
Score: the cycle's density.
InC: the number of cycles the Target word occurs in with the same score.
length: the length of the cycle.

How to generate the data

$python get.Data.py en EN-nouns.txt > EN-dict.txt (*)

$python calculateCycles.py EN-dict.txt en v > Targets-EN.txt

(*) getData.py generates one dict for each input word, you need to 'join' all dicts into a single one before running calculateCycles.py script.