-
Notifications
You must be signed in to change notification settings - Fork 270
GSOC2013_Progress_Hady Elsahar
hady elsahar edited this page Sep 4, 2013
·
36 revisions
- Sebastian Hellman
- Dimitris kontokostas
#Project Progress:
- public clone of Extraction framework
- preparing development environment
- compiling the Extraction framework
- Getting to know DBpedia main classes structures of the extraction framework
readings
- papers No. #1 #2 #4 in DBpedia publications
- http://wiki.dbpedia.org/Documentation
important discussions :
- exploring the PubSubHubbub Protocol
- installing a local Hub and subscribing to some RSS Feed
Overview about the PubSubHubbub protocol
readings
- PubSubHubbub home page : https://code.google.com/p/pubsubhubbub/
important discussions :
- Create a RDF dump out of 1-2K WikiData entities
- work on the language links from API:
- process Wikidata info, generate master IL links file.
- produce language-specific same_as files from master IL links file,
- Create a few mappings in the mappings wiki (as owl:equivalentProperty). The most common ones in the dumps
important discussions :
- step 1: Creating Master LLinks file (replacing the old bash commands with scala code)
- Step 2: Creating specific LLinks extraction in folders (after some number of code iterations we agreed upon that we can depend on that links comes in blocks ) , Implemented Algorithm
- updating code to utilize some Extraction framework utilities instead of rewriting them
- Code Reviews 1 , 2 ,3
- More code reviews , some code conflicts
important links/Discussions :
- implicit conversions in Scala
- Master branch uses Scala 2.9 , Dump branch uses Scala 2.10
- Updating RichReader.foreach to support end of lines detection
- recent commits : https://github.com/hadyelsahar/extraction-framework/commits/lang-link-extract
--- off to Leipzig 2-8 > 6-8
- updating Pom.xml (adding scala launcher for LL scala scripts)
- setting lgd.aksw server (cloning repos , managing conflicted files , run maven install)
- Running wda-export-data.py script on lgd server
important discussions/Links :
Language links extraction process:
- Running the wda script and using the option 'turtle-links'
- unzipping the extracts and convert it to Nturtle format using rapper
rapper -i turtle turtle-20130808-links.ttl
- Generating Master LLfiles using command
sudo mvn scala:run -Dlauncher=GenerateLLMasterFile
- Generate specific Language links files :
sudo mvn scala:run -Dlauncher=GenerateLLSpecificFiles
ps: in steps 3 and 4 update the arguments of each script (the location of input / output dumps ) in the pom.xml file inside the scripts folder
what's done so far :
- 7M triples that passed from rapper phase without encountering a bug
- Running Master LL files extraction ( the output dump in
/root/hady_wikidata_extraction/Datasets/languagelinks/MasterLLfile.nt
) - Running Specific LL files extraction ( the output now is in
/root/hady_wikidata_extraction/Datasets/languagelinks/LLfiles/
)
Benchmark (for the 7 Million triples on the lgd server):
- Generating Master LLfile : 28 secs
- Generating Sepcific Files : 3 Minutes ,10 seconds
Updates #2:
- Running the new version of wda python script
- Running rapper on the resulted dump (
/Datasets/turtle-20130811-links.ttl
) - [Bugs Found] only 7.5M triples extracted (500K more) in (
/Datasets/turtle-20130811-links.nt
)
- WikiData-DBpedia-Dump-Release-v.0.1
- Re-implementing Quad Method to accept Wikidata (String) properties with unknown language
quads += new Quad(null , DBpediaDatasets.WikidataFacts, subjectUri, property ,fact , page.sourceUri, context.ontology.datatypes("xsd:string"))
- updated Ontology Reader class to get properties/class mappings between WikiData and DBpedia
- Wikidata Mapped Dump produced with Mapped properties for URI triples only
- update ontologyReader class commits : 1, 2
- Discussion about using either Wikidata URIs or DBpedia new namespace for them