Artikel getaggt mit tutorial

(English) How to setup a local DBPedia Live mirror

This blog describes how you can setup a local DBPedia Live mirror on your own machine based on the official DBPedia downloads.

What is DBPedia?

DBPedia aims to extract structured information from the unstructured data source Wikipedia, which is used by millions of people on a daily basis to lookup information in a very current manner. (Research shows that Wikipedia is regularly one of the fastest updated information sources when any kind of event happens).

Wikipedia itself is a wiki-based system, so anybody can insert any kind of unstructured information, but sometimes you require structured information e.g. to get a list of location, persons etc.: this is where DBPedia comes into the game.

DBPedia uses configurable information mappings to extract RDF-formatted information from the Wikipedia infoboxes. this extraction happens every 3-4 months, in the meantime the DBPedia content does not change.

(DBPedia is a project managed by the university of Leipzig, Free University of Berlin and Openlink SW. Find more information on the DBPedia Imprint )

What is DBPedia Live?

DBPedia Live adresses the need for more current information based on the wikipedia sources: it applies the information mapping framework to recently changed Wikipedia articles and updates the information.

DBPedia Live itself provides an SPARQL-based endpoint, but for own use cases and applications you might be interested in an own SPARQL endpoint.

The DBPedia team provides a synchronization tool to keep the local mirrors in sync with the “mothership”: the tools downloads additions, changes & deletions and applies them to your local RDF store.

Which infrastructure is needed?

I installed the software on a virtual Ubuntu Server 11.10 (64-bit version): the virtual machine has 100GB of storage. (You better take a little bit more space if you can)

Installation steps

  1. After Ubuntu server installation you need first to install the OpenLink Virtuoso Server (in the non-cluster-enabled open source version, which is free). Please take care of the note at “Package Contents and Layout”: the script provided by DBPedia refers to the “isql” command, which needs to be replaced with “/usr/bin/isql-vt”.
  2. Start downloading the initial data seed from DBPedia Live at http://live.dbpedia.org/dumps/: you should download the latest file named with a date. (the download is ~2.7GB in zipped version: so usage of a download manager such as KGet on Ubuntu is a good idea, this means a 30GB un-zipped file on your disk)
  3. Download the synchronization tool from http://sourceforge.net/projects/dbpintegrator/files/ . The synchronization tool includes the virtload.sh script, which is essential to upload the DBpedia content to the local RDF store.
  4. The virtload.sh script does the following:
    • it unzips the DBPedia file if needed,
    • in case it is large the file is sliced into small pieces of 50k lines of text each. (In case you want to use the file for e.g.another RDF store, please make sure to keep the lines intact: the n3-format of the triples if broken if you mix up the text lines)
    • later each sliced file is uploaded.
    • Take care to adapt the script at least regarding the “isql-vt”-issue mentioned before.
  5. You need to start the virtload-script by  ”virtload.sh <triple-file> <graph name, e.g. http://live.dbpedia.org&gt; <port, regularly 1111> <username, e.g. dba> <password>.
    • the graphname allows you to structure the namespace in the RDF store.
    • the standard username is dba,
    • the password for the dba user is assigned during the Virutoso install process.
  6. Wait…

  I am currently at step 6: ;-) the machine runs since 1,5 days and has approx. 25% of the task done. To avoid any interruptions you should make sure that the process can run day & night: the script can only be executed completely in one step. Also make sure that the harddisk has some free space all the time.

7. Adapt the synchronzation tools configuration files according to the readme.txt files.

8. Start the synchronization tool and make it regular shell scipt via e.g. a cron entry.

Already during the import process you can use the SPARQL endpoint,e.g. to check information of Berlin (or http://dbpedia.org/resource/Berlin in DBPedia speak).

Special kudos for putting together the very helpful DBPIntegrator package goes to Mohamed Morsey from the Leipzig university staff.

 

 

 

 

 

, , , ,

Hinterlasse einen Kommentar

Follow

Bekomme jeden neuen Artikel in deinen Posteingang.

Schließe dich 691 Followern an