-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leaks when creating Donut model #10
Comments
Would you please give more hints about your scenario? For example, the core code of how you use Donut? |
Thanks for your reply. Here is some associated code about how I create a Donut model and restore it from file. I create a VAEModel that wraps all methods I want to use with Donut, below is the init method. The model() function contains initial network structure. Here is how I restore the model. The memory leak happens every time I try to restore a model. I wrapped restore operation with an api developed by Django. When the api is requested, it creates a VAEModel and calls restore_model(), and does something else, then gives response and finishes. It is not reasonable that a Graph instance still left in memory after the request handler had finished. I'm sure the VAEModel has been deleted and there is no visible reference connected with Graph. What are the other unknown references? Seems that they are from tfsnippet and zhusuan module which serves as static scopes in memory. Thanks for your help! |
The VariableSaver (and also tf.train.Saver) may actually create new graph nodes. So you'd better use a shared VariableSaver instead of creating a new one each time you restore the model. As for the graph reference, tfsnippet should have used weakref in The TensorFlow is a very sophisticated library, thus it's very hard to ensure whether or not your code can work without memory leak. In my experience, disposing a session totally seems to be impossible -- did you know that in some earlier versions, the GPU memory will be totally locked up by a single call to I've heard about a library named |
Hi, I am Minghao23's colleague that working on this project together. After these days debugging by objgraph I have located the leak point. I hope this will be helpful to you. You could see from this figure that the module zhusuan hold a static variable without release. I checked code in zhusuan library and found two @classmethod that create static variables in zhusuan.variational.base.py I solved the leak problem we are facing by release the static variables explicit after we finish the function. I think you could figure out how to fix this issue in zhusuan. |
Unfortunately I'm not the author of ZhuSuan. You may consult @thjashin, or open a pull request to fix this memory leak. Another choice is to switch to newer versions (>=v0.1.2) of TFSnippet, where I have rewritten sgvb by myself, thus will not call ZhuSuan |
Thank you for your reply. I think that it is not possible for us to update TFSnippet currently, due to it will pumps up a lot of works and unpredictable problems in our project based on the donut. |
I have upgraded the dependency to |
Hi,
I tried to use Donut for an anomaly detection project. For some reasons, I separate the processes of restoring model and prediction, and problem happened while restoring model. Every time I create a Donut model and a Donut Trainer to restore a new model from saved file, there will be a 'Graph' instance left in memory with 1,400+ unknown back references, even if I have already cleared Donut and all other possible instances, and done garbage collection after that. This will lead memory keeps increasing until the process is shut down, when I called restore multiple times.
I used objgraph.show_growth() to monitor memory, and got this after completely finishing calling restore model.
These instances left in memory until process terminated. objgraph cannot output detailed graph of references, since the amount of reference could be too large. I tried to check the Donut code and did't find any suspicious part. Are there any possible reasons for this problem? Thanks.
The text was updated successfully, but these errors were encountered: