GSOC2013_Progress_Hady Elsahar

integrating WikiData in DBpedia

proposal

full project proposal

Students

Hady Elsahar

Mentors

Sebastian Hellman
Dimitris kontokostas

#Project Progress:

week 1 :

public clone of Extraction framework
preparing development environment
compiling the Extraction framework
Getting to know DBpedia main classes structures of the extraction framework

readings

papers No. #1 #2 #4 in DBpedia publications
http://wiki.dbpedia.org/Documentation

important discussions :

Warmup task - URI Schema to use for wikiData - using or not the WikiData RDF format

week 2 [17-6-2013] :

exploring the PubSubHubbub Protocol
installing a local Hub and subscribing to some RSS Feed

Overview about the PubSubHubbub protocol

readings

PubSubHubbub home page : https://code.google.com/p/pubsubhubbub/

important discussions :

week 3 [24-6-2013] :

Create a RDF dump out of 1-2K WikiData entities
work on the language links from API:
1. process Wikidata info, generate master IL links file.
2. produce language-specific same_as files from master IL links file,
Create a few mappings in the mappings wiki (as owl:equivalentProperty). The most common ones in the dumps

important discussions :

Different Language links Files

Weeks 4,5,6,7 Language Links Extraction [1-7-2013] -> [1-8-2013] :

step 1: Creating Master LLinks file (replacing the old bash commands with scala code)
Step 2: Creating specific LLinks extraction in folders (after some number of code iterations we agreed upon that we can depend on that links comes in blocks ) , Implemented Algorithm
updating code to utilize some Extraction framework utilities instead of rewriting them
Code Reviews 1 , 2 ,3
More code reviews , some code conflicts

important links/Discussions :

--- off to Leipzig 2-8 > 6-8

week 8 [5-8-2013] - [11-8-2013] :

updating Pom.xml (adding scala launcher for LL scala scripts)
setting lgd.aksw server (cloning repos , managing conflicted files , run maven install)
Running wda-export-data.py script on lgd server

important discussions/Links :

WikiData RDF export available

week 9 [12-8-2013] - [18-8-2013] :

Language links extraction process:

Running the wda script and using the option 'turtle-links'
unzipping the extracts and convert it to Nturtle format using rapper rapper -i turtle turtle-20130808-links.ttl
Generating Master LLfiles using command sudo mvn scala:run -Dlauncher=GenerateLLMasterFile
Generate specific Language links files : sudo mvn scala:run -Dlauncher=GenerateLLSpecificFiles

ps: in steps 3 and 4 update the arguments of each script (the location of input / output dumps ) in the pom.xml file inside the scripts folder

what's done so far :

7M triples that passed from rapper phase without encountering a bug
Running Master LL files extraction ( the output dump in /root/hady_wikidata_extraction/Datasets/languagelinks/MasterLLfile.nt )
Running Specific LL files extraction ( the output now is in /root/hady_wikidata_extraction/Datasets/languagelinks/LLfiles/ )

Benchmark (for the 7 Million triples on the lgd server):

Generating Master LLfile : 28 secs
Generating Sepcific Files : 3 Minutes ,10 seconds

Updates #2:

Running the new version of wda python script
Running rapper on the resulted dump (/Datasets/turtle-20130811-links.ttl)
[Bugs Found] only 7.5M triples extracted (500K more) in (/Datasets/turtle-20130811-links.nt)

important links :

wda script Dump Bug report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSOC2013_Progress_Hady Elsahar

integrating WikiData in DBpedia

proposal

Students

Mentors

week 1 :

week 2 [17-6-2013] :

week 3 [24-6-2013] :

Weeks 4,5,6,7 Language Links Extraction [1-7-2013] -> [1-8-2013] :

week 8 [5-8-2013] - [11-8-2013] :

week 9 [12-8-2013] - [18-8-2013] :

important links :

Clone this wiki locally