-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generation w/ TDB for huge (~15GB) MySQL Database on small machine #24
Comments
Hi, Thanks for noting this. Indeed, the rr:join increases ram usage by a lot, and is best avoided for large datasets, as you very correctly do: place the joins in the queries. The ram consumption can be passed as a parameter to the java environment. Did you try to increase parameters Xms and Xmx, or fiddle with jvm GC tuning? Also, is your jvm 32 or 64bit? In the stack trace, where does the exception occur? Could you please copy-paste this or the log file output? Best, |
Hi, Maybe that helps. |
Hi both, I suppose I might have also to optimize the template: the first part of the template, as I did not have much experience, is still a bit ugly. I get this exception.
@chrissofvista Thanks everyone and have a nice day |
Another idea would be to run the tool remotely, from a more capable machine: the tool would establish a jdbc connection and dump the contents into the other machine. This would be slower though. Another idea would be, as @chrissofvista suggests: the use of limits. Then, you could export a small number of rdf files, which could subsequently be loaded into a tdb store using jena command-line tools [1]. If you go down that road, I would suggest that you set jena.destinationFileSyntax to N-TRIPLE. Contrarily to RDF/XML and TTL, N-TRIPLE does not try to pretty-print the result, thus consuming way fewer resources. |
Remember, you are generating triples. Hopefully within a named graph. You do not have to care about merging the result files since they represent a certain ammount of triples. They are all merged when they are imported into your triple store of choice and are indentified by their URI and their Graph. Cheers |
I was wondering if there is something like an internal array used to fill in the results of the parser and if it would be possible to clear that after lets say 1.000 entries in order to keep memory usage small in the app? |
With the current implementation, all the model has to be loaded in memory before flushing it out to disk. Not sure how this could be sliced into e.g. logical table mappings. Regarding garbage collection, the idea I had in mind would be to see if fiddling with any of the java vm parameters (see for instance [1] or [2]) would allow for larger model sizes to be generated without an out-of-memory exception. |
Would it be possible to add a limit in the mapping file. So the R2RML tool converts in loops of X rows and create separate rdf models from each or multiple loops? Then the size of database does not matter anymore. |
We are trying to get a huge mysql database of around 15GB size into a TDB model using this tool.
However after a few minutes of running we get a Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded. We are working on a machine with 8GB RAM.
How could we possibly get all the data of the triples stored, without having to increase the RAM to stratosferic levels? Is it possible to store the data gradually in the binary tree for TDB?
Are we doing something wrong?
At the moment we are barely using the rr:join functionalities and we are executing the joins directly in the SQL statments. Is this a bad practice and cause of the problem?
Thanks a lot
The text was updated successfully, but these errors were encountered: