GSOC2013_Progress_Hady Elsahar

integrating WikiData in DBpedia

proposal

full project proposal

Students

Hady Elsahar

Mentors

Sebastian Hellman
Dimitris kontokostas

#Project Progress:

week 1 :

public clone of Extraction framework
preparing development environment
compiling the Extraction framework
Getting to know DBpedia main classes structures of the extraction framework

readings

papers No. #1 #2 #4 in DBpedia publications
http://wiki.dbpedia.org/Documentation

important discussions :

Warmup task - URI Schema to use for wikiData - using or not the WikiData RDF format

week 2 [17-6-2013] :

exploring the PubSubHubbub Protocol
installing a local Hub and subscribing to some RSS Feed

Overview about the PubSubHubbub protocol

readings

PubSubHubbub home page : https://code.google.com/p/pubsubhubbub/

important discussions :

week 3 [24-6-2013] :

Create a RDF dump out of 1-2K WikiData entities
work on the language links from API:
1. process Wikidata info, generate master IL links file.
2. produce language-specific same_as files from master IL links file,
Create a few mappings in the mappings wiki (as owl:equivalentProperty). The most common ones in the dumps

important discussions :

Different Language links Files

Weeks 4,5,6,7 Language Links Extraction [1-7-2013] -> [1-8-2013] :

step 1: Creating Master LLinks file (replacing the old bash commands with scala code)
Step 2: Creating specific LLinks extraction in folders (after some number of code iterations we agreed upon that we can depend on that links comes in blocks ) , Implemented Algorithm
updating code to utilize some Extraction framework utilities instead of rewriting them
Code Reviews 1 , 2 ,3
More code reviews , some code conflicts

important links/Discussions :

--- off to Leipzig 2-8 > 6-8

week 8 [5-8-2013] - [11-8-2013] :

updating Pom.xml (adding scala launcher for LL scala scripts)
setting lgd.aksw server (cloning repos , managing conflicted files , run maven install)
Running wda-export-data.py script on lgd server

important discussions/Links :

WikiData RDF export available

week 9 [12-8-2013] - [18-8-2013] :

Language links extraction process:

Running the wda script and using the option 'turtle-links'
unzipping the extracts and convert it to Nturtle format using rapper rapper -i turtle turtle-20130808-links.ttl
Generating Master LLfiles using command sudo mvn scala:run -Dlauncher=GenerateLLMasterFile
Generate specific Language links files : sudo mvn scala:run -Dlauncher=GenerateLLSpecificFiles

ps: in steps 3 and 4 update the arguments of each script (the location of input / output dumps ) in the pom.xml file inside the scripts folder

what's done so far :

7M triples that passed from rapper phase without encountering a bug
Running Master LL files extraction ( the output dump in /root/hady_wikidata_extraction/Datasets/languagelinks/MasterLLfile.nt )
Running Specific LL files extraction ( the output now is in /root/hady_wikidata_extraction/Datasets/languagelinks/LLfiles/ )

Benchmark (for the 7 Million triples on the lgd server):

Generating Master LLfile : 28 secs
Generating Sepcific Files : 3 Minutes ,10 seconds

Updates #2:

Running the new version of wda python script
Running rapper on the resulted dump (/Datasets/turtle-20130811-links.ttl)
[Bugs Found] only 7.5M triples extracted (500K more) in (/Datasets/turtle-20130811-links.nt)

important links :

wda script Dump Bug report

week 10 [19-8-2013] - [25-8-2013] :

Setting Extraction framework environment and Running initial Code added for WikiJsonParer Language links Extraction (locally and on the lgd server)
updating Extraction Framework to Download Wikidata Dumps and Daily Wikidata Dumps
Writing a WikidataLLExtractor for extraction of DBpedia Language links in the format

<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://ceb.dbpedia.org/resource/Betta_splendens> .
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://war.dbpedia.org/resource/Betta_splendens> .
<http://oc.dbpedia.org/resource/Betta_splendens> <http://www.w3.org/2002/07/owl#sameAs> <http://bn.dbpedia.org/resource/সিয়ামিজ_লড়াকু_মাছ> .

updating WikidataJsonParser for extraction of WikiData Labels
Writing a WikidataLLExtractor for extraction of DBpedia Language links in the format

<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Bojovnica pestrá"@sk .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Beta"@tr .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.w3.org/2000/01/rdf-schema#label> "Жауынгер балық"@kk .

Creating wikidataSameasExtractor to extract sameeas Mapping links between Wikidata entities and DBpedia URIs

<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://da.dbpedia.org/resource/Sidney_Govou> .
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://bg.dbpedia.org/resource/Сидни_Гову> .
<http://wikidata.dbpedia.org/resource/Q1934> <http://www.w3.org/2002/07/owl#sameAs> <http://ar.dbpedia.org/resource/سيدني_غوفو> .

updating WikidataJsonParser to allow wikidata Facts extraction
Creating wikidataFactsExtractor to extract Wikidata Facts Triples in the form

<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P473> "040"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P625> "53 10"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P281> "20537"@en .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P131> <http://wikidata.dbpedia.org/resource/Q1626> .
<http://wikidata.dbpedia.org/resource/Q1569> <http://www.wikidata.org/entity/P107> <http://wikidata.dbpedia.org/resource/Q618123> .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P574> "+00000001910-01-01T00:00:00Z"@en .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P18> "http://commons.wikimedia.org/wiki/File:HM_Orange_M_Sarawut.jpg"@en .
<http://wikidata.dbpedia.org/resource/Q549> <http://www.wikidata.org/entity/P373> "Betta splendens"@en .

WikiData-DBpedia-Dump-Release-v.0.1

important links :

week 11 [26-8-2013] - [1-9-2013] :

adding Wikidata namespace to the mapping wiki to allow using of wikidata:xx indicating wikidata entities
writing some command lines to get updated Mappings properties from live owl file
building documents for allowing community contribution of adding Mappings between Wikidata and DBpedia properties
adding mappings for 21 Wikidata properties

important links :

week 12 [2-9-2013] - [8-9-2013] :

Re-implementing Quad Method to accept Wikidata (String) properties with unknown language

quads += new Quad(null , DBpediaDatasets.WikidataFacts, subjectUri, property ,fact , page.sourceUri, context.ontology.datatypes("xsd:string"))

updated Ontology Reader class to get properties/class mappings between WikiData and DBpedia
Wikidata Mapped Dump produced with Mapped properties for URI triples only
added NodeType to SimpleNode for each extractor to know type of data returned from parser (LL,Labels,MappedFacts,Facts)
updating JsonParser to return Data for Mapped extractor in nodes with it's NodeType
updating WikidataMappedFactsExtractor to generate triples for Wikidata properties of Type globecoordinate in the form

<http://wikidata.dbpedia.org/resource/Q5689> <http://www.w3.org/2003/01/geo/wgs84_pos#lat> "60"^^<http://www.w3.org/2001/XMLSchema#float> .
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.georss.org/georss/point> "POINT(60 20)" .
<http://wikidata.dbpedia.org/resource/Q5689> <http://www.w3.org/2003/01/geo/wgs84_pos#long> "20"^^<http://www.w3.org/2001/XMLSchema#float> .

Adding Regex in DateTime Parser to parse Wikidata Time in the ISO8601 Format
updating WikidataMappedFactsExtractor to generate Triples with Wikidata Time facts , Mapped to DBpedia properties and DBpedia dataTypes

<http://wikidata.dbpedia.org/resource/Q41380> <http://dbpedia.org/ontology/deathDate> "1035-07-09"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q41380> <http://dbpedia.org/ontology/birthDate> "1000-06-28"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q40512> <http://dbpedia.org/ontology/deathDate> "1986-09-07"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://wikidata.dbpedia.org/resource/Q40512> <http://dbpedia.org/ontology/birthDate> "1914-09-23"^^<http://www.w3.org/2001/XMLSchema#date> .

important links :

update ontologyReader class commits : 1, 2
Discussion about using either Wikidata URIs or DBpedia new namespace for them
DateTime format for Wikidata Properties
Update DateTime parser Commit

week 13 [9-9-2013] - [15-9-2013] :

updated WikidataMappedFactsExtractor to generate MappedFacts for wikidata properties of Datatype CommonMediafile and String in the form :

<http://wikidata.dbpedia.org/resource/Q7194> <http://dbpedia.org/ontology/imageFlag> <http://commons.wikimedia.org/wiki/File:Flag_of_Girona_province_(unofficial).svg> .
<http://wikidata.dbpedia.org/resource/Q5772> <http://dbpedia.org/ontology/imageFlag> <http://commons.wikimedia.org/wiki/File:Flag_of_the_Region_of_Murcia.svg> .
<http://wikidata.dbpedia.org/resource/Q9465> <http://dbpedia.org/ontology/individualisedGnd> "4015602-3" .
<http://wikidata.dbpedia.org/resource/Q9957> <http://dbpedia.org/ontology/individualisedGnd> "118998935" .

important links :

no Wikidata properties Yet for "numbers" and "numbers for units".

End of GSoC2013 period

Refactoring the core to accept new formats :

change Extractor Trait to accept [T] type argument [see commit]

change all existing Extractors to accept type PageNode
change functions in config.scala to load Extractors of type 'any'
check compositeExtractor.scala to check for Extractor Type
run and check that update works fine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSOC2013_Progress_Hady Elsahar

integrating WikiData in DBpedia

proposal

Students

Mentors

week 1 :

week 2 [17-6-2013] :

week 3 [24-6-2013] :

Weeks 4,5,6,7 Language Links Extraction [1-7-2013] -> [1-8-2013] :

week 8 [5-8-2013] - [11-8-2013] :

week 9 [12-8-2013] - [18-8-2013] :

important links :

week 10 [19-8-2013] - [25-8-2013] :

important links :

week 11 [26-8-2013] - [1-9-2013] :

important links :

week 12 [2-9-2013] - [8-9-2013] :

important links :

week 13 [9-9-2013] - [15-9-2013] :

important links :

End of GSoC2013 period

Refactoring the core to accept new formats :

Clone this wiki locally