Add Wikipedia to demo #1189

yakagami · 2023-12-12T01:50:10Z

yakagami
Dec 12, 2023

I think Wikipedia would be a great example for QLever. Besides unstructued text, Wikipedia has Infoboxes that contain structured data according to categorical templates. One can extract this data into RDF records similar to wikidata. Currently dbpedia does this and it takes "around 4-7 days" to convert the raw XML from wikipedia dumps to their RDF database. They have a SPARQL endpoint here. The database as I understand is updated only every 3 months compared to the wikipedia dumps which are every two weeks. Text search is obviously slow/nonexistant. Adding this would certainly be more involved/difficult to set up but should have a good payoff both to demonstate QLever's indexing and query speed and to provide a kowledge source to users.

hannahbast · 2023-12-12T02:07:52Z

hannahbast
Dec 12, 2023
Maintainer

@yakagami There is a QLever instance for DBpedia on https://qlever.cs.uni-freiburg.de/dbpedia . It takes only around one hour to build. I have three questions:

With DBpedia, it's never quite clear what to index because they provide so many different files for download. And you can't just merge them all together because some are alternatives of others. It's very confusing.
Their download site https://databus.dbpedia.org/dbpedia/collections/latest-core seems to be broken for quite some time now
Only a fraction of the DBpedia data is from Wikipedia infoboxes, the rest is all kinds of information from different sources with widely varying completeness and quality

0 replies

yakagami · 2023-12-12T02:19:38Z

yakagami
Dec 12, 2023
Author

Ah. I didn't think to check if dbpedia was available already, only wikipedia.

1 and 2. Yes, I find thier site very confusing. As another example they have dbpedia live which has an endpoint here that has seemingly been down for over a year.

I didn't know that. I think that adds to my point that you may want to consider using the wikipedia dumps directly. Like from eg. https://dumps.wikimedia.org/enwiki/. I don't know how much work would be involved to make that work (I assume a lot) and how much indexing/parsing code can be re-used from dbpedia.

I will try to take a look sometime to see what is necessary to extract just the infobox data from wikipedia and convert that to RDF.

0 replies

hannahbast · 2023-12-12T02:35:28Z

hannahbast
Dec 12, 2023
Maintainer

Also note that the English Wikipedia is already part of our Wikidata instance. See "Index Information" on https://qlever.cs.uni-freiburg.de/wikidata .

To show you an example of what's possible, here is an example query that finds all mentions of an astronaut from Wikidata in a sentence from Wikipedia: https://qlever.cs.uni-freiburg.de/wikidata/UmCdKa . You can combine arbitrary SPARQL searches on Wikidata with arbitrary keyword queries on Wikipedia that way.

If you have particular use cases in mind, let us know.

1 reply

yakagami Dec 12, 2023
Author

@hannahbast Thanks for the info and that example. Can it search infoboxes? If so, would that be index or through text search? If the latter, what is your opinion on indeximg the infobox data as RDF instead of as text? What is the performance like? Lastly, how often is the en_wiki data updated?

The usecase is finding infobox fields that exists in wikipedia but do not have corespondimg statements in Wikidata. This casues a lot of duplication of efforts.

hannahbast · 2023-12-12T03:04:36Z

hannahbast
Dec 12, 2023
Maintainer

This instance is updated once per week. But it indexes the text, not the infoboxes.

Turning the infoboxes into triples looks rather straightforward to me. And in my understanding, it should be a (small) part of DBpedia. Maybe you can find out which DBpedia files cover this.

If DBpedia is no longer updated frequently, it should also be relatively easy to extract the infoboxes from the Wikipedia articles oneself. Wikipedia is not that large (and much smaller than Wikidata, as far as sheer data sizeis concerned).

0 replies

yakagami · 2023-12-12T03:51:47Z

yakagami
Dec 12, 2023
Author

Seems like https://github.com/dbpedia/extraction-framework/blob/587d999f1b92221605b3c27d9c930ef12ab4aed1/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala is part of it. I will continue to look into this topic. Thanks for your input.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Wikipedia to demo #1189

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Add Wikipedia to demo #1189

yakagami Dec 12, 2023

Replies: 5 comments · 1 reply

hannahbast Dec 12, 2023 Maintainer

yakagami Dec 12, 2023 Author

hannahbast Dec 12, 2023 Maintainer

yakagami Dec 12, 2023 Author

hannahbast Dec 12, 2023 Maintainer

yakagami Dec 12, 2023 Author

yakagami
Dec 12, 2023

Replies: 5 comments 1 reply

hannahbast
Dec 12, 2023
Maintainer

yakagami
Dec 12, 2023
Author

hannahbast
Dec 12, 2023
Maintainer

yakagami Dec 12, 2023
Author

hannahbast
Dec 12, 2023
Maintainer

yakagami
Dec 12, 2023
Author