Issue with OOM #449

dcoracle · 2022-09-24T05:18:55Z

dcoracle
Sep 24, 2022

Hi, I have been stuck on this for the past week and eventually zeroed in on an issue with Microstream. (Or maybe I am using it wrong).

We noticed several of our microservice pods getting OOMKilled even though I thought we set our java startup parameters correctly. Heap dumps, JFRs didn't show things that looked out of the ordinary, but when performing a 'top' I noticed the RES/RSS going way beyond what was even captured with java NMT (i.e. jcmd VM.native_memory). Eventually it gets killed and the cycle happens again.

Our use case is simple. We want to cache data we pull from Oracle DB into microstream (We schedule a bulk load every 10 minutes to refresh it). One of our largest Lists is around 250MB which doesn't seem too large. At first I thought there was an issue with the NioFilesSystem but then changed the storage to SQLite and it ends up being the same but I am following the examples as much as possible.

I know from a SO article microstream is implemented using direct ByteBuffers so this explains why I'm having a hard time trying to profile this using Java-based tools. I even tried jemalloc thinking it was a fragmentation problem. I believe it helped slow it down but didn't solve the leak.

This is a simplified flow of our boot. Assuming the microstream database was already persisted, our pods start sequence was like this (we had more than 1 EmbeddedStorageManager -> 1 per cache):

                EmbeddedStorageManager storageMgr = EmbeddedStorage.start(fileSystem.ensureDirectoryPath(somePath));
                System.out.println("Started Microstream Storage Manager for cache: " + someName);

                //set root Instance for use (implementation agnostic)
                ce.setRootInstance(storageMgr.root());

                storageMgr.close();

Basically, we just wanted to pull the data from the file system, load it in memory for use, and shut down the storage manager until we need to reload it again. Now, I placed a bunch of code to test how the memory looked after I closed the StorageManager, and even cleared the reference to root and performed explicit GCs which works to clear it from Java heap. However, the RSS/RES was not going down and off-heap mem kept going up every time I did this. I even tried the various methods like:

                storageMgr.issueFullGarbageCollection();
                storageMgr.issueFullCacheCheck();
                storageMgr.issueFullFileCheck();

to no avail. This was not noticeable with our smallest caches, but for the 250MB one, it was. btw, our root is set to this:
List<JsSource> where JsSource is

                public class JsSource {
                   private String name;
                   private String code;
                   private Date dateCreated;
                   ...
               }

Our container is Oracle Linux 8.4 with the latest version of openjdk17, with helidon 2.4.1 (I used the one integrated with it 5.0GA but also confirmed the same with 7.0GA release of Microstream)

I guess the first question is, are we using the StorageManager the right way?

Thanks,
Derek

Answered by hg-ms

Sep 28, 2022

Many thanks for your efforts, and sorry that I was not able to provide a good solution.

Unfortunately, I was not able to create a scenario that causes a memory leak the way you described until now. Maybe I missed an important detail…
I’ll forward your bug description to your test guys, maybe they can reproduce that problem some day.

View full answer

dcoracle · 2022-09-27T03:46:51Z

dcoracle
Sep 27, 2022
Author

FYI, after a lot of trial and error, I found a workaround to this issue. I ended up wrapping the root with a RootWrapper object containing a Lazy Reference, then just before I closed the StorageManager, I called the .clear() method. The off-heap RES then ended up getting freed as expected.

           public class RootWrapper {
                 private Lazy<Object> rootInstance = Lazy.Reference(null);
           }

           ...
           ...

            EmbeddedStorageManager storageMgr = EmbeddedStorage.start(fileSystem.ensureDirectoryPath(somePath));
            System.out.println("Started Microstream Storage Manager for cache: " + someName);

            //set root Instance for use (implementation agnostic)
            ce.setRootInstance(storageMgr.root());

            ...
            ...

            //Before the next scheduled job:
            storageMgr.root().getRootInstance().clear();
            storageMgr.close();

This also seems related to an earlier discussion about not being able to call .shutdown() and .start() right after, which is odd. Also, intuitively, any resources tied to the StorageManager should be cleanly closed. Especially since it implements AutoCloseable, I would expect it to act similarly to a DataSource connection.close(). This was particularly nasty since the leak didn't show up in the heap dumps, JFRs, or even NMT.

0 replies

hg-ms · 2022-09-27T10:02:26Z

hg-ms
Sep 27, 2022
Collaborator

Hello,

That solution with the Lazy reference is interesting.
Seeing that caused me to reconsider the ideas I had regarding the memory leak.

Do you keep a reference to the data returned from storageMgr.root() inside the “ce” instance?

If doing so, this might explain why the memory is not freed after shutting down Microstream. Microstream will never clear any user data from memory (except if ‘Lazy’ is used), that is the task of the Java GC. If there is any reference to the returned value of storageMgr.root() inside your application the Java GC can’t free that memory even if Microstream has be shut down. Maybe you can set this references to null if needed.

In your use case it may be an option to put all the code to load the storage data in a single method, that way everything except the loaded data can be clean up easily:

public static List<JsSource> loadStorageData() {
	final EmbeddedStorageManager storage = EmbeddedStorage.start();
	
	final Object data = storage.root();
	storage.shutdown();
	
	if(data != null) {
		return (List<JsSource>) storage.root();
	}
			
	//data is null in case of an empty storage
	return null;
}

Regarding the shutdown() / start() topic I have to apologize for that on behalf of the team.

1 reply

dcoracle Sep 27, 2022
Author

You read my mind. After my last post, I refactored the code so I directly worked off the StorageManager, thinking the same (i.e. ce holding the reference to the root). I changed the logic so I got rid of that root reference and instead kept the storageManager alive until the data needed to be refreshed from source again (that is when I called close) but still experienced the leak so I re-added the RootWrapper logic.

For your suggestion, so even if the storage is shutdown, I can still call root()?

dcoracle · 2022-09-28T12:21:08Z

dcoracle
Sep 28, 2022
Author

FYI, we are at the postmortem phase of this issue as we are confirming this fix in a live environment. I will be closing this topic within our team but figured this might be useful to your team. It was the initial stacktrace that gave the strongest clue where the leak was occurring (Line numbers should match 05.00.02-MS-GA version of microstream):

     java.lang.OutOfMemoryError: Cannot reserve 147978 bytes of direct buffer memory (allocated: 524164383, limit: 524288000)
         at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
         at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:121)
         at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:332)
         at one.microstream.memory.XMemory.allocateDirectNative(XMemory.java:1079)
         at one.microstream.persistence.binary.types.ChunksBuffer.allocateNewCurrent(ChunksBuffer.java:187)
         at one.microstream.persistence.binary.types.ChunksBuffer.addBuffer(ChunksBuffer.java:182)
         at one.microstream.persistence.binary.types.ChunksBuffer.enlargeBufferCapacity(ChunksBuffer.java:145)
         at one.microstream.persistence.binary.types.ChunksBuffer.ensureFreeStoreCapacity(ChunksBuffer.java:160)
         at one.microstream.persistence.binary.types.ChunksBuffer.storeEntityHeader(ChunksBuffer.java:233)
         at one.microstream.persistence.binary.types.Binary.storeStringSingleValue(Binary.java:1146)
         at one.microstream.persistence.binary.types.Binary.storeStringSingleValue(Binary.java:1135)
         at one.microstream.persistence.binary.types.Binary.storeStringSingleValue(Binary.java:1126)
         at one.microstream.persistence.binary.java.lang.BinaryHandlerString.store(BinaryHandlerString.java:69)
         at one.microstream.persistence.binary.java.lang.BinaryHandlerString.store(BinaryHandlerString.java:1)
         at one.microstream.persistence.binary.internal.AbstractBinaryHandlerCustom.store(AbstractBinaryHandlerCustom.java:1)
         at one.microstream.persistence.binary.types.BinaryStorer$Default.storeItem(BinaryStorer.java:434)
         at one.microstream.persistence.binary.types.BinaryStorer$Default.storeGraph(BinaryStorer.java:423)
         at one.microstream.persistence.binary.types.BinaryStorer$Default.storeAll(BinaryStorer.java:450)
         at one.microstream.persistence.types.PersistenceManager$Default.storeAll(PersistenceManager.java:312)
         at one.microstream.storage.types.StorageConnection.storeAll(StorageConnection.java:376)
         at one.microstream.storage.embedded.types.EmbeddedStorageManager$Default.storeRoot(EmbeddedStorageManager.java:184)
         at com.oracle.oal.worklistplus.cache.util.WlpCacheManagerUtils.storeRootInstance(WlpCacheManagerUtils.java:155)
         at com.oracle.oal.worklistplus.cache.provider.WlpCacheManagerProvider.load(WlpCacheManagerProvider.java:138)
         at com.oracle.oal.worklistplus.cache.service.rest.WlpCacheManagerService.lambda$loadHandler$3(WlpCacheManagerService.java:126)
         at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
         at java.base/java.lang.Thread.run(Thread.java:833)

0 replies

dcoracle · 2022-09-28T13:06:08Z

dcoracle
Sep 28, 2022
Author

One last thing I wanted to note to clarify the issue. The GC is indeed clearing the objects as intended as observed in the JFRs and heap dumps. It stays well below our Xmx setting. It is the off-heap memory. I suspect there is a use case where the PersistenceManager did not properly call deallocateDirectByteBuffer()

0 replies

hg-ms · 2022-09-28T13:59:13Z

hg-ms
Sep 28, 2022
Collaborator

Many thanks for your efforts, and sorry that I was not able to provide a good solution.

Unfortunately, I was not able to create a scenario that causes a memory leak the way you described until now. Maybe I missed an important detail…
I’ll forward your bug description to your test guys, maybe they can reproduce that problem some day.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with OOM #449

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Issue with OOM #449

dcoracle Sep 24, 2022

Replies: 5 comments · 1 reply

dcoracle Sep 27, 2022 Author

hg-ms Sep 27, 2022 Collaborator

dcoracle Sep 27, 2022 Author

dcoracle Sep 28, 2022 Author

dcoracle Sep 28, 2022 Author

hg-ms Sep 28, 2022 Collaborator

dcoracle
Sep 24, 2022

Replies: 5 comments 1 reply

dcoracle
Sep 27, 2022
Author

hg-ms
Sep 27, 2022
Collaborator

dcoracle Sep 27, 2022
Author

dcoracle
Sep 28, 2022
Author

dcoracle
Sep 28, 2022
Author

hg-ms
Sep 28, 2022
Collaborator