Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks when creating Donut model #10

Open
Minghao23 opened this issue Mar 1, 2019 · 7 comments
Open

Memory leaks when creating Donut model #10

Minghao23 opened this issue Mar 1, 2019 · 7 comments

Comments

@Minghao23
Copy link

Hi,

I tried to use Donut for an anomaly detection project. For some reasons, I separate the processes of restoring model and prediction, and problem happened while restoring model. Every time I create a Donut model and a Donut Trainer to restore a new model from saved file, there will be a 'Graph' instance left in memory with 1,400+ unknown back references, even if I have already cleared Donut and all other possible instances, and done garbage collection after that. This will lead memory keeps increasing until the process is shut down, when I called restore multiple times.

I used objgraph.show_growth() to monitor memory, and got this after completely finishing calling restore model.

image

These instances left in memory until process terminated. objgraph cannot output detailed graph of references, since the amount of reference could be too large. I tried to check the Donut code and did't find any suspicious part. Are there any possible reasons for this problem? Thanks.

@haowen-xu
Copy link
Collaborator

Would you please give more hints about your scenario? For example, the core code of how you use Donut?

@Minghao23
Copy link
Author

Thanks for your reply. Here is some associated code about how I create a Donut model and restore it from file.

I create a VAEModel that wraps all methods I want to use with Donut, below is the init method.

image

The model() function contains initial network structure.

image

Here is how I restore the model.

image

The memory leak happens every time I try to restore a model. I wrapped restore operation with an api developed by Django. When the api is requested, it creates a VAEModel and calls restore_model(), and does something else, then gives response and finishes. It is not reasonable that a Graph instance still left in memory after the request handler had finished. I'm sure the VAEModel has been deleted and there is no visible reference connected with Graph. What are the other unknown references? Seems that they are from tfsnippet and zhusuan module which serves as static scopes in memory.

Thanks for your help!

@haowen-xu
Copy link
Collaborator

The VariableSaver (and also tf.train.Saver) may actually create new graph nodes. So you'd better use a shared VariableSaver instead of creating a new one each time you restore the model.

As for the graph reference, tfsnippet should have used weakref in instance_reuse and global_reuse to track any seen graph instances, so it should not be the obstacle -- at least for the recent versions (0.2.0+), I'm not sure whether the 0.1 version works as expected.

The TensorFlow is a very sophisticated library, thus it's very hard to ensure whether or not your code can work without memory leak. In my experience, disposing a session totally seems to be impossible -- did you know that in some earlier versions, the GPU memory will be totally locked up by a single call to tensorflow.python.client.device_lib.list_local_devices()? (see https://github.com/haowen-xu/tfsnippet/blob/develop/tfsnippet/examples/utils/multi_gpu.py#L30 where I managed to deal with this issue by calling the method in a sub-process).

I've heard about a library named TensorFlow-Serving, which is designed dedicatedly to expose a trained model as a web service. I have no experience with this library, but you may try it.

@ZicongZhang
Copy link

Hi, I am Minghao23's colleague that working on this project together. After these days debugging by objgraph I have located the leak point. I hope this will be helpful to you.
I do not think it is a problem of Saver, because even I only call restore_model once, there is still a little part of memory kept after the function finish.
obj_chain

You could see from this figure that the module zhusuan hold a static variable without release. I checked code in zhusuan library and found two @classmethod that create static variables in zhusuan.variational.base.py
5cfef502-e6d5-4624-93c6-bfa2c0c26542

I solved the leak problem we are facing by release the static variables explicit after we finish the function. I think you could figure out how to fix this issue in zhusuan.

@haowen-xu
Copy link
Collaborator

Unfortunately I'm not the author of ZhuSuan. You may consult @thjashin, or open a pull request to fix this memory leak.

Another choice is to switch to newer versions (>=v0.1.2) of TFSnippet, where I have rewritten sgvb by myself, thus will not call ZhuSuan VariationalObjective. Note that due to some reasons, I've introduced many breaking changes since v0.2.0, including removing the VAE class. As a result, although it's more recent and should have fewer issues (and more importantly, I'm keeping developing this new version), you may have to migrate some old classes from v0.1 to v0.2, if you want to use the newest version.

@ZicongZhang
Copy link

Thank you for your reply. I think that it is not possible for us to update TFSnippet currently, due to it will pumps up a lot of works and unpredictable problems in our project based on the donut.
We may leave it as a future work to fix this. If possible, it will be really helpful for us if you could update the donut library to the newest TFSnippet version.
And I will try to consult @thjashin at the same time if he could fix this bug.

@haowen-xu
Copy link
Collaborator

I have upgraded the dependency to v0.1.2 and it passes unit tests. You may have a try. But I'll not port Donut to tfsnippet >= v0.2, because the design philosophy changes a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants