Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBpedia Indexing Fails #103

Open
mkyl opened this issue Dec 10, 2024 · 4 comments
Open

DBpedia Indexing Fails #103

mkyl opened this issue Dec 10, 2024 · 4 comments

Comments

@mkyl
Copy link

mkyl commented Dec 10, 2024

Hello again,

I am now trying to setup a qlever instance running DBpedia, after successfully doing this for Wikidata with your help.

moe@blackbird:~/dbpedia$ qlever get-data
Command: get-data

curl -X POST -H "Accept: text/csv" --data-urlencode "query=$(curl -s -H "Accept:text/sparql" https://databus.dbpedia.org/dbpedia/collections/latest-core)" https://databus.dbpedia.org/sparql | tail -n+2 | sed 's/\r$//' | sed 's/"//g' | while read -r file; do wget -P rdf-input $file; done

Download successful, total file size: 14,383,708,584 bytes

The download appears successful. However, when I try to build the index, it fails:

moe@blackbird:~/dbpedia$ qlever index

Command: index

echo '{ "ascii-prefixes-only": true, "num-triples-per-batch": 1000000, "prefixes-external": [""] }' > dbpedia.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.dbpedia docker.io/adfreiburg/qlever:latest -c 'ulimit -Sn 1048576; (cat rdf-input/*.nt; lbzcat -n2 rdf-input/*.bzip2 rdf-input/*.bz2) | IndexBuilderMain -i dbpedia -s dbpedia.settings.json -F ttl -f - --stxxl-memory 5G | tee dbpedia.index-log.txt'

2024-12-09 23:36:57.921 - INFO: QLever IndexBuilder, compiled on Sat Nov 16 16:25:12 UTC 2024 using git hash 39ca68
2024-12-09 23:36:57.922 - INFO: Locale was not specified in settings file, default is en_US
2024-12-09 23:36:57.922 - INFO: You specified "locale = en_US" and "ignore-punctuation = 0"
2024-12-09 23:36:57.922 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2024-12-09 23:36:57.922 - INFO: You specified "num-triples-per-batch = 1,000,000", choose a lower value if the index builder runs out of memory
2024-12-09 23:36:57.922 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-12-09 23:36:57.922 - WARN: Implicitly using the parallel parser for a single input file for reasons of backward compatibility; this is deprecated, please use the command-line option --parse-parallel or -p
2024-12-09 23:36:57.922 - INFO: Parsing triples from single input stream /dev/stdin (parallel = true) ...
2024-12-09 23:36:57.923 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-12-09 23:36:58.727 - ERROR:  Parse error at byte position 45950791: Unterminated IRI reference (found '<' but no '>' before one of the following characters: <, ", newline)
The next 500 bytes are:
<http://dbpedia.org/class/yago/WikicatAlbumsProducedByNoah"40"Shebib> .
<http://dbpedia.org/resource/Tyga> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/WikicatAlbumsProducedByT-Minus> .
<http://dbpedia.org/resource/Tyga> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/WikicatEnglish-languageCompilationAlbums> .
<http://dbpedia.org/resource/Tyga> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/Wi

It appears that some of the files are truncated? I tried deleting the downloaded files and retrying the download. I still got the same error.

Thank you for offering support for this program,
Moe Kayali
PhD Student
Database Group, University of Washington, Seattle

@hannahbast
Copy link
Member

@mkyl The Turtle standard very clearly says that IRI references (that is, <...>) must not contain quotes: https://www.w3.org/TR/turtle/#grammar-production-IRIREF . Yet the DBpedia data is full of IRI references with quotes.

Are you able to compile QLever yourself? Then I can give you a patch that relaxes the parsing so that QLever accepts the invalid IRI references

@Dakantz
Copy link

Dakantz commented Dec 12, 2024

Hey,

I have encountered the same error - would it be possible to publish the patch somewhere as a branch?

Best,
Benedikt

EDIT: I have imported the dbpedia data into a TDB2 database, exported it (got rid of the offending triplets) and are importing the data atm. @mkyl I could provide the dump if you are interested, it might miss a few links but seems to be ok otherwise.

@hannahbast
Copy link
Member

@Dakantz I have already opened a PR that solves the problem: ad-freiburg/qlever#1672 . With this PR, DBpedia can be loaded without problems.

This still needs a little cleanup and a corresponding command-line option (in particular, to suppress the "non-compliant IRI" warnings, which would be way too many for a dataset like DBpedia). We will eventually merge this into the master.

@mkyl
Copy link
Author

mkyl commented Dec 13, 2024

Thanks @hannahbast and @Dakantz. I was trying to see if I could write a script to clean up the triples, but upon a first look I mainly found a large number of .bz2 archives in the rdf-input directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants