Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation w/ TDB for huge (~15GB) MySQL Database on small machine #24

Open
tognimat opened this issue Mar 14, 2017 · 8 comments
Open

Comments

@tognimat
Copy link

We are trying to get a huge mysql database of around 15GB size into a TDB model using this tool.

However after a few minutes of running we get a Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded. We are working on a machine with 8GB RAM.

How could we possibly get all the data of the triples stored, without having to increase the RAM to stratosferic levels? Is it possible to store the data gradually in the binary tree for TDB?
Are we doing something wrong?
At the moment we are barely using the rr:join functionalities and we are executing the joins directly in the SQL statments. Is this a bad practice and cause of the problem?

Thanks a lot

@nkons
Copy link
Owner

nkons commented Mar 15, 2017

Hi,

Thanks for noting this.

Indeed, the rr:join increases ram usage by a lot, and is best avoided for large datasets, as you very correctly do: place the joins in the queries.

The ram consumption can be passed as a parameter to the java environment. Did you try to increase parameters Xms and Xmx, or fiddle with jvm GC tuning?

Also, is your jvm 32 or 64bit? In the stack trace, where does the exception occur? Could you please copy-paste this or the log file output?

Best,
Nikos

@chrissofvista
Copy link

Hi,
I know that problem. Played around with 4GB SQL db during the last weeks. What i did was to place an inline sql query that imported only chunks of data based on their id, like:
[ rr:sqlQuery """ SELECT * FROM mydatabase.db where id BETWEEN 11800001 AND 12500000 """ ];

Maybe that helps.

@tognimat
Copy link
Author

Hi both,
Thanks for the good insights.
@nkons
Yes I've tried already to play with Xms and Xmx values but of course my RAM is limited as well (8GB nominal size).
My jvm is 64bit.
I have not tried yet to fiddle with the garbage collector yet and to be honest I didn't even think about it.
Do you maybe already have some tips based on your past experiences?
I've seen on this article that there are many types of garbage collectors, and different options affect them differently. I don't really have an idea about it.

I suppose I might have also to optimize the template: the first part of the template, as I did not have much experience, is still a bit ugly.
I have a SQL table with "types" defined that I would like to map to different variables and I simply repeated the same map with different names for each type and using different IDs. I think that a solution use the SQL CASE statement would reduce memory usage, as the movie IDs would not have to be retrieved multiple times. There are 15 of those type_ids in my table,
so I believe it might be reasonable to use the approach with the CASE statements.

I get this exception.

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.HashMap.createEntry(HashMap.java:897)
	at java.util.HashMap.putForCreate(HashMap.java:550)
	at java.util.HashMap.putAllForCreate(HashMap.java:555)
	at java.util.HashMap.<init>(HashMap.java:298)
	at com.hp.hpl.jena.util.CollectionFactory.createHashedMap(CollectionFactory.java:47)
	at com.hp.hpl.jena.shared.impl.PrefixMappingImpl.getNsPrefixMap(PrefixMappingImpl.java:166)
	at com.hp.hpl.jena.rdf.model.impl.ModelCom.getNsPrefixMap(ModelCom.java:1038)
	at gr.seab.r2rml.entities.Template.isUri(Template.java:96)
	at gr.seab.r2rml.beans.UtilImpl.fillTemplate(UtilImpl.java:104)
	at gr.seab.r2rml.beans.Generator.createTriples(Generator.java:282)
	at gr.seab.r2rml.beans.Main.main(Main.java:104)
R2RML Parser 0.8-alpha. Done.

@chrissofvista
I have used a similar approach as well but I was just wondering if I was doing something wrong myself.
However I have a question: that approach can just be applied when using dump on file, am I correct? Otherwise what is generated is a bunch of binary files that cannot be merge together with the existing one. Should I then generate chunks and then import the resulting file dumps using the import function from the website?

Thanks everyone and have a nice day

@nkons
Copy link
Owner

nkons commented Mar 16, 2017

Another idea would be to run the tool remotely, from a more capable machine: the tool would establish a jdbc connection and dump the contents into the other machine. This would be slower though.

Another idea would be, as @chrissofvista suggests: the use of limits. Then, you could export a small number of rdf files, which could subsequently be loaded into a tdb store using jena command-line tools [1]. If you go down that road, I would suggest that you set jena.destinationFileSyntax to N-TRIPLE. Contrarily to RDF/XML and TTL, N-TRIPLE does not try to pretty-print the result, thus consuming way fewer resources.

  1. https://jena.apache.org/documentation/tdb/commands.html

@chrissofvista
Copy link

Remember, you are generating triples. Hopefully within a named graph. You do not have to care about merging the result files since they represent a certain ammount of triples. They are all merged when they are imported into your triple store of choice and are indentified by their URI and their Graph.
There is a problem whemn thinking about incremental updates from that source which will be a different story. At the moment we delete and re- import the graph database for each update.
I hope I have understood your question right.

Cheers
Christian

@chrissofvista
Copy link

I was wondering if there is something like an internal array used to fill in the results of the parser and if it would be possible to clear that after lets say 1.000 entries in order to keep memory usage small in the app?
I think that is ment by the garbage collector previously mentioned.

@nkons
Copy link
Owner

nkons commented Mar 27, 2017

With the current implementation, all the model has to be loaded in memory before flushing it out to disk. Not sure how this could be sliced into e.g. logical table mappings.

Regarding garbage collection, the idea I had in mind would be to see if fiddling with any of the java vm parameters (see for instance [1] or [2]) would allow for larger model sizes to be generated without an out-of-memory exception.

  1. http://www.oracle.com/technetwork/articles/java/vmoptions-jsp-140102.html
  2. http://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/index.html

@nicky508
Copy link

Would it be possible to add a limit in the mapping file. So the R2RML tool converts in loops of X rows and create separate rdf models from each or multiple loops? Then the size of database does not matter anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants