Memory leaks when creating Donut model #10

Minghao23 · 2019-03-01T03:15:08Z

Hi,

I tried to use Donut for an anomaly detection project. For some reasons, I separate the processes of restoring model and prediction, and problem happened while restoring model. Every time I create a Donut model and a Donut Trainer to restore a new model from saved file, there will be a 'Graph' instance left in memory with 1,400+ unknown back references, even if I have already cleared Donut and all other possible instances, and done garbage collection after that. This will lead memory keeps increasing until the process is shut down, when I called restore multiple times.

I used objgraph.show_growth() to monitor memory, and got this after completely finishing calling restore model.

These instances left in memory until process terminated. objgraph cannot output detailed graph of references, since the amount of reference could be too large. I tried to check the Donut code and did't find any suspicious part. Are there any possible reasons for this problem? Thanks.

haowen-xu · 2019-03-04T07:48:45Z

Would you please give more hints about your scenario? For example, the core code of how you use Donut?

Minghao23 · 2019-03-04T09:38:40Z

Thanks for your reply. Here is some associated code about how I create a Donut model and restore it from file.

I create a VAEModel that wraps all methods I want to use with Donut, below is the init method.

The model() function contains initial network structure.

Here is how I restore the model.

The memory leak happens every time I try to restore a model. I wrapped restore operation with an api developed by Django. When the api is requested, it creates a VAEModel and calls restore_model(), and does something else, then gives response and finishes. It is not reasonable that a Graph instance still left in memory after the request handler had finished. I'm sure the VAEModel has been deleted and there is no visible reference connected with Graph. What are the other unknown references? Seems that they are from tfsnippet and zhusuan module which serves as static scopes in memory.

Thanks for your help!

haowen-xu · 2019-03-04T10:18:41Z

The VariableSaver (and also tf.train.Saver) may actually create new graph nodes. So you'd better use a shared VariableSaver instead of creating a new one each time you restore the model.

As for the graph reference, tfsnippet should have used weakref in instance_reuse and global_reuse to track any seen graph instances, so it should not be the obstacle -- at least for the recent versions (0.2.0+), I'm not sure whether the 0.1 version works as expected.

The TensorFlow is a very sophisticated library, thus it's very hard to ensure whether or not your code can work without memory leak. In my experience, disposing a session totally seems to be impossible -- did you know that in some earlier versions, the GPU memory will be totally locked up by a single call to tensorflow.python.client.device_lib.list_local_devices()? (see https://github.com/haowen-xu/tfsnippet/blob/develop/tfsnippet/examples/utils/multi_gpu.py#L30 where I managed to deal with this issue by calling the method in a sub-process).

I've heard about a library named TensorFlow-Serving, which is designed dedicatedly to expose a trained model as a web service. I have no experience with this library, but you may try it.

ZicongZhang · 2019-03-05T02:42:07Z

Hi, I am Minghao23's colleague that working on this project together. After these days debugging by objgraph I have located the leak point. I hope this will be helpful to you.
I do not think it is a problem of Saver, because even I only call restore_model once, there is still a little part of memory kept after the function finish.

You could see from this figure that the module zhusuan hold a static variable without release. I checked code in zhusuan library and found two @classmethod that create static variables in zhusuan.variational.base.py

I solved the leak problem we are facing by release the static variables explicit after we finish the function. I think you could figure out how to fix this issue in zhusuan.

haowen-xu · 2019-03-05T02:52:05Z

Unfortunately I'm not the author of ZhuSuan. You may consult @thjashin, or open a pull request to fix this memory leak.

Another choice is to switch to newer versions (>=v0.1.2) of TFSnippet, where I have rewritten sgvb by myself, thus will not call ZhuSuan VariationalObjective. Note that due to some reasons, I've introduced many breaking changes since v0.2.0, including removing the VAE class. As a result, although it's more recent and should have fewer issues (and more importantly, I'm keeping developing this new version), you may have to migrate some old classes from v0.1 to v0.2, if you want to use the newest version.

ZicongZhang · 2019-03-06T02:46:38Z

Thank you for your reply. I think that it is not possible for us to update TFSnippet currently, due to it will pumps up a lot of works and unpredictable problems in our project based on the donut.
We may leave it as a future work to fix this. If possible, it will be really helpful for us if you could update the donut library to the newest TFSnippet version.
And I will try to consult @thjashin at the same time if he could fix this bug.

haowen-xu · 2019-03-06T03:23:41Z

I have upgraded the dependency to v0.1.2 and it passes unit tests. You may have a try. But I'll not port Donut to tfsnippet >= v0.2, because the design philosophy changes a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks when creating Donut model #10

Memory leaks when creating Donut model #10

Minghao23 commented Mar 1, 2019

haowen-xu commented Mar 4, 2019

Minghao23 commented Mar 4, 2019

haowen-xu commented Mar 4, 2019

ZicongZhang commented Mar 5, 2019

haowen-xu commented Mar 5, 2019

ZicongZhang commented Mar 6, 2019

haowen-xu commented Mar 6, 2019

Memory leaks when creating Donut model #10

Memory leaks when creating Donut model #10

Comments

Minghao23 commented Mar 1, 2019

haowen-xu commented Mar 4, 2019

Minghao23 commented Mar 4, 2019

haowen-xu commented Mar 4, 2019

ZicongZhang commented Mar 5, 2019

haowen-xu commented Mar 5, 2019

ZicongZhang commented Mar 6, 2019

haowen-xu commented Mar 6, 2019