-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential Bug with find.lua - Multiple GPUs #319
Comments
Preferred method to use a subset of GPUs is setting CUDA_VISIBLE_DEVICES, otherwise torch will try to create context on all the GPUs, and with memory on your "busy" GPUs already allocated that could fail. Setting CUDA_VISIBLE_DEVICES to a single GPU should work. Do you have a repro where it fails? The errors that you have (cublas not initialized) are totally unrelated to cudnn.torch, looks like something is wrong with the setup. I also suspect that require 'cutorch' would result in the same error. |
Thanks for the quick response.
I did just that. Tried it with each idle GPUs (one at a time) which leads to the cublas error.
The reason I posted it on this repo is because
I have all the paths (cuda/cudnn) set correctly. If its incorrect, then cutorch or cunn shouldn't load right?
Not sure what this means. You mean like an example code/scenario? Just running Also, I checked again just now when all GPUs are idle and EDIT - This recent cutorch issue seems very relevant to mine. |
Hi,
I have no idea how this cropped up, but
require 'cudnn'
threw an out of memory errorThis is strange because I have 4 GPUs (all TitanX; 2 idle and 2 busy) which can be detected by
cutorch.getDeviceCount()
, after explicitly settingcutorch.setDevice()
to an idle device and verifying that current GPU is indeed idle usingcutorch.getDevice()
andcutorch.getMemoryUsage()
.For some weird reason, calling
require 'cudnn'
sets the current device to a busy one with all the memory occupied. After digging a little into the traceback, I found that in init.luafind.reset()
is called withcutorch.synchronizeAll()
here. In cutorch's init.c, this call cycles through all available GPUs and performs asynchronize()
Changing this to
cutorch.synchronize()
seems to solve this error, although I dont know if I've broken anything else.I've tried updating all the cudnn, cunn and cutorch modules to the latest. Finally also tried a fresh install of torch, to no effect.
Please let me know If I'm missing something obvious here.
OS - Ubuntu 14.04
CUDA - 7.5
cuDNN - 5103
GPUs - 4 Nvidia TitanX
The 2 busy GPUs are running tensorflow which I think allocates all the memory by default.
EDIT - making that change to
find.lua
breaks the code.Also tried setting CUDA_VISIBLE_DEVICES to a single GPU. This causes a long traceback to be printed
The text was updated successfully, but these errors were encountered: