Skip to content

GSOC2013_Progress_Hady Elsahar

hady elsahar edited this page Aug 12, 2013 · 36 revisions

integrating WikiData in DBpedia

proposal

full project proposal

Students

Mentors

  • Sebastian Hellman
  • Dimitris kontokostas

#Project Progress:

week 1 :

  • public clone of Extraction framework
  • preparing development environment
  • compiling the Extraction framework
  • Getting to know DBpedia main classes structures of the extraction framework

readings

important discussions :


week 2 [17-6-2013] :

  • exploring the PubSubHubbub Protocol
  • installing a local Hub and subscribing to some RSS Feed

Overview about the PubSubHubbub protocol

readings

important discussions :


week 3 [24-6-2013] :

  • Create a RDF dump out of 1-2K WikiData entities
  • work on the language links from API:
    1. process Wikidata info, generate master IL links file.
    2. produce language-specific same_as files from master IL links file,
  • Create a few mappings in the mappings wiki (as owl:equivalentProperty). The most common ones in the dumps

important discussions :


Weeks 4,5,6,7 Language Links Extraction [1-7-2013] -> [1-8-2013] :

  • step 1: Creating Master LLinks file (replacing the old bash commands with scala code)
  • Step 2: Creating specific LLinks extraction in folders (after some number of code iterations we agreed upon that we can depend on that links comes in blocks ) , Implemented Algorithm
  • updating code to utilize some Extraction framework utilities instead of rewriting them
  • Code Reviews 1 , 2 ,3
  • More code reviews , some code conflicts

important links/Discussions :


--- off to Leipzig 2-8 > 6-8


week 8 [5-8-2013] - [11-8-2013] :

  • updating Pom.xml (adding scala launcher for LL scala scripts)
  • setting lgd.aksw server (cloning repos , managing conflicted files , run maven install)
  • Running wda-export-data.py script on lgd server

important discussions/Links :


week 9 [12-8-2013] - [18-8-2013] :

Language links extraction process:

  • Running the wda script and using the option 'turtle-links'
  • unzipping the extracts and convert it to Nturtle format using rapper rapper -i turtle turtle-20130808-links.ttl
  • Generating Master LLfiles using command sudo mvn scala:run -Dlauncher=GenerateLLMasterFile
  • Generate specific Language links files : sudo mvn scala:run -Dlauncher=GenerateLLSpecificFiles

ps: in steps 3 and 4 update the arguments of each script (the location of input / output dumps ) in the pom.xml file inside the scripts folder

what's done so far :

  • 7M triples that passed from rapper phase without encountering a bug
  • Running Master LL files extraction ( the output dump in /root/hady_wikidata_extraction/Datasets/languagelinks/MasterLLfile.nt )
  • Running Specific LL files extraction ( the output now is in /root/hady_wikidata_extraction/Datasets/languagelinks/LLfiles/ )

Benchmark (for the 7 Million triples on the lgd server):

  • Generating Master LLfile : 28 secs
  • Generating Sepcific Files : 3 Minutes ,10 seconds

Updates #2:

  • Running the new version of wda python script
  • Running rapper on the resulted dump (/Datasets/turtle-20130811-links.ttl)
  • [Bugs Found] only 7.5M triples extracted (500K more) in (/Datasets/turtle-20130811-links.nt)