Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithread Variable Input Size Fix #160

Open
wants to merge 109 commits into
base: master
Choose a base branch
from

Conversation

jonathanasdf
Copy link

Use the same instance of iSize to fix #155

After fixing this for SpatialConvolution, I went ahead and made the the same changes to other modules. Are there any tests that I can run to make sure this did not break anything?

@fmassa
Copy link
Collaborator

fmassa commented Apr 6, 2016

Nice!
I think that it could break saved models that haven't the iSize field (i.e., when you save a model before ever forwarding it).
Wouldn't it be better to simply lazily-initialize self.iSize = self.iSize or torch.LongTensor(4):zero() in createIODescriptors ?

@soumith
Copy link
Owner

soumith commented Apr 6, 2016

i think we never save a model before forwarding it, it seems reasonable as a change. the lazy initialization wont work with the threading issue here.

@fmassa
Copy link
Collaborator

fmassa commented Apr 6, 2016

@soumith that happened at least once in the past with me torch/nn#712 (comment) , but I'd say it's a fair enough change :)

@jonathanasdf
Copy link
Author

jonathanasdf commented Apr 6, 2016

I kept both the explicit initialization in the constructor and the lazy initialization in createIODescriptors.

The explicit initialization is needed so that the threads share the same tensor - if there were only the lazy initialization, then each thread would create its own copy of iSize, but they still end up using the same descriptors, which leads to the problem of descriptors sometimes not being the right size.

But, when a model is created with nn modules and converted using cudnn.convert, __init is not called so iSize is not initialized. I left the lazy initialization in so this does not crash, but in that case the original bug would still exist..

@jonathanasdf
Copy link
Author

Removed lazy initialization of iSize based on discussion above and added some hacky code to convert. It seems to still work fine based on my tests on AlexNet. Please let me know if there is a better way, or if you would prefer not to merge this change.

@jonathanasdf jonathanasdf force-pushed the SpatialConvolution-multithread branch from f58cf88 to 60d84f9 Compare April 6, 2016 18:10
@borisfom
Copy link
Contributor

borisfom commented Apr 6, 2016

A month ago I have taken a stab at refactoring Convolution classes : a very outdated version is here https://github.com/borisfom/cudnn.torch/tree/R5_exp (FullConvolution changed fundamentally since then). However, it may give you some ideas. The most ROI on reuse I believe can be achieved extracting the part that uses cudnnFind functons: this can be generalized across most classes, not just convolutions. Also, the selection of algo/workspace can be further improved by using new CUDNN FindEx functions.

@borisfom
Copy link
Contributor

borisfom commented Apr 6, 2016

I found some neat examples of taking advantage of Lua binding of C functions by name in the new RNN class contributed by NVidia folks here: https://github.com/borisfom/cudnn.torch/blob/R5/RNN.lua (createDescriptors etc).

@jonathanasdf jonathanasdf changed the title Spatial convolution multithread Multithread Variable Input Size Fix Apr 8, 2016
@jonathanasdf jonathanasdf force-pushed the SpatialConvolution-multithread branch from 60d84f9 to 6c51cae Compare April 8, 2016 16:19
soumith and others added 30 commits August 6, 2016 13:32
It seems like Lua 5.3 doesn't like it when you put floats into long tensors. Simply taking the floor explicitly (which is what Lua 5.2 does implicitly) seems to work.
Prevent BatchNorm from backward in evaluate mode
fix output params for cudnnGetFilterNdDescriptor
resetStates() also reset grad{Hidden,Cell}Output
Volumetric softmax and cross entropy criterion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cuDNN.torch not thread safe when input size changes