Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip parse error #1583

Closed
debayan opened this issue Oct 24, 2024 · 6 comments
Closed

Skip parse error #1583

debayan opened this issue Oct 24, 2024 · 6 comments

Comments

@debayan
Copy link

debayan commented Oct 24, 2024

Is there an option in QLever to skip lines that produce parse errors during indexing?

@hannahbast
Copy link
Member

@debayan Can you give an example of the kind of line you would like to skip?

Roughly speaking, there are two sorts of errors in RDF input:

  1. Errors that, strictly speaking, violate the standard, but are kind of OK to accept. For example, an IRI that contains a space. QLever currently outputs a WARNing for those, but accepts them.

  2. Errors that should really be fixed by the producers of the dataset because they point to a deeper problem. For example, an N-Triples file containing a line, where the object is missing, or where the closing " of a literal is missing.

@debayan
Copy link
Author

debayan commented Oct 28, 2024

@hannahbast

My log shows something like:

INFO: By default, integers that cannot be represented by QLever will throw an exception
INFO: Parsing input triples and creating partial vocabularies, one per batch ...
ERROR: Parse error at byte position 6388106517: Parse error at byte position 6388106517: Value 400.000 could not be parsed as an integer value

I get this when parsing ttl files from https://downloads.dbpedia.org/repo/lts/wikidata/. This is not the first error I got. I fixed several such errors already (not just integer errors), and I do not know how many other errors exist in this data dump. Since this is a large amount of data, I would rather just skip all such errors and add the dump to my DB.

@hannahbast
Copy link
Member

@debayan Have you validated the files? Here is a command to do that. Does it only produce warnings or also errors?

docker run -i --rm -v $(pwd):/data stain/jena riot --validate /data/filename.ttl

And may I ask why you want to use DBpedia? It's old, not well maintained anymore, and of really doubtful quality. I don't think there is anything useful in DBpedia that is not contained in one of the more modern knowledge graphs, notably Wikidata.

@debayan
Copy link
Author

debayan commented Oct 28, 2024

@hannahbast I have not validated the files, and I know it has erroneous lines. I am using DBpedia because we are working on a task where the queries from one KG need to be translated to queries that work on another KG. The only dataset we could find of a reasonable size is LC-QuAD 2.0 which has queries for both KGs for a given question.

@hannahbast
Copy link
Member

@debayan This has come up again in ad-freiburg/qlever-control#103 and I have looked at it again because I was working on the parser anyway. If it's still relevant for you (or will be again at some point), please join the conversation over there

@debayan
Copy link
Author

debayan commented Dec 13, 2024

Thanks @hannahbast

@debayan debayan closed this as completed Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants