-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when trying to use CudaHalfTensor for training #332
Comments
this is presumably a bug in cudnn 5.1 itself. |
@Marcel1991 : Can you post a snippet of your code where this is happening ? Also: try using CUDNN6. A number of FP16 issues fixed there. Check out R6 barnch of this repo and do 'luarocks make cudnn-scm-1.rockspec'. |
@borisfom : So cutorch.hasFastHalfInstructions() returns false. My GPU is a Titan X Pascal. I tried CUDNN6 with the R6 branch now. It's still not working but now I get a new error that seems to point more to the direction where something is going wrong:
The relevant code is: batchInputs = torch.CudaHalfTensor()
batchLabels = torch.CudaHalfTensor()
-- function trains one minibatch on module
function TrainManager.trainBatch(self, batchInputsCpu, batchLabelsCpu)
local waitTime = waitTimer:time().real
cutorch.synchronize()
local batchTimer = torch.Timer()
collectgarbage() -- free unused memory
cutorch.synchronize()
local options = self.options
-- copy data into gpu tensors
batchInputs:resize(batchInputsCpu:size()):copy(batchInputsCpu)
batchLabels:resize(batchLabelsCpu:size()):copy(batchLabelsCpu)
local batchLoss
-- sgd expects function with input: moduleParameters; output: loss, gradParams
local opFunction = function(modelParameters)
model:zeroGradParameters()
local outputs = model:forward(batchInputs)
batchLoss = criterion:forward(outputs, batchLabels)
local gradientOutputs = criterion:backward(outputs, batchLabels)
model:backward(batchInputs, gradientOutputs)
-- L2 regularization
-- ignore to add l2 loss to error due to fair comparison of different l2 settings
-- batchLoss = batchLoss + optimisationState.regL2 * torch.norm(modelParameters, 2)^2/2
--gradientParameters:add( modelParameters:clone():mul(optimisationState.regL2) )
return batchLoss, gradientParameters
end
optim.adam(opFunction, modelParameters, optimisationState)
... The error occurs at the last line when the adam() function is called. The same happens with sgd() function |
Does anyone use CudaHalfTensor successfully with the Titan X Pascal? And if yes, what nvidia driver do you use and which Ubuntu version? |
I want to train my model using fp16 precision on my gpu. My gpu has the Pascal architecture and the cutorch.hasHalf flag indicates true. I am using cuDNN 5.1 and CUDA Toolkit 8.0.
As far as I understand it right I only have to change the Tensors that are allocated on my gpu from CudaTensor to CudaHalfTensor and the calculations should be in fp16 precision. However, when I do that I get an error on using the optim.sgd() function that says: "No algorithms found that would fit in free GPU memory".
Am I doing something wrong? Or is fp16 actually supported for a VGG16 model using sgd?
The detailed error message is:
The text was updated successfully, but these errors were encountered: