Skip to content

Extraction Instructions

jimkont edited this page Feb 19, 2013 · 33 revisions

Download

$ git clone git://github.com/dbpedia/extraction-framework.git

or use this page for more detailed instructions

Dump-Based Extraction

In the root directory run the following commands

$ mvn clean install # Compiles the code
$ cd dump
$ ../run download config=download-config-file # Downloads the wikipedia dumps
$ ../run extraction extraction-config-file # Extracts triples form the downloaded dumps

For download-config-file & extract-config-file you can either re-use existing files from the repository or adapt them to your needs

Abstract Extraction

Abstracts are not generated by the Simple Wiki Parser, they are produced by a local wikipedia clone using a modified mediawiki installation.

In order to generate clean abstracts from Wikipedia articles one needs to render wiki templates as they would be rendered in the original Wikipedia instance. So in order for the DBpedia Abstract Extractor to work, a running Media Wiki instance with Wikipedia data in a MySQL database is necessary.

To install and start the MySQL server, you can use dump/src/main/bash/mysql.sh .

To import the data, you need to run the Scala 'import' launcher:

First you have to adapt the settings for the 'import' launcher in dump/pom.xml:

<arg>/home/release/wikipedia</arg><!-- path to folder containing Wikipedia XML dumps -->
<arg>/home/release/data/projects/mediawiki/core/maintenance/tables.sql</arg><!--file containing MediaWiki table definitions -->
<arg>localhost</arg><!--  MySQL host:port - localhost should work if you use mysql.sh -->
<arg>true</arg><!--  require-download-complete -->
<arg>10000-</arg><!-- languages and article count ranges, comma-separated, e.g. "en,de" -->

Then you need to cd to dump/ and call

../run import

This should import all the templates into the MySQL tables.

To set up the local Wikipedia instance you have to use the modified MediaWiki code from here: https://github.com/dbpedia/dbpedia/tree/master/abstractExtraction and configure it to listen to the URL from here

Core Module

Download Ontology

To download a fresh copy of the DBpedia ontology from the mappings wiki, use the following commands

$ cd ../core
$ ../run download-ontology