You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I hope I'm wrong but it seems that the current training mechanism uses the validation set to update the weights during training which is unwanted. The problem comes from the fact that we calculate the gradient by loss.backward() in both training and validation.
The curring training mechanism can be reduce to the following pseudo code (in srl-zoo/models/learner.py)
forsampleindataloader:
ifsampleintrain_set: ## training modemodel.train()
else: ## validation modemodel.eval()
optimizer.zero_grad()
Y_pred=model(X)
loss=compute_loss(Y_pred, Y)
loss.backward() # <-- [Wrong] We backpropagate the gradient in both train/valid modeifsampleintrain_set:
optimizer.step()
loss=loss.item()
else:
# We don't update the weights at this iteration, but the gradients of loss on validation# samples are calculated and stored and will be used the next time we call optimizer.step()loss=loss.item()
The common way to validate a model in Pytorch should look like the following:
forepochinrange(epochs):
model.train()
forsampleindataloader_train:
optimizer.zero_grad()
loss=compute_loss(...)
loss.backward()
loss=loss.item() ## release tensormodel.eval()
withtorch.no_grad(): ## It mandatory to call both model.eval() and torch.no_grad()forsampleindataloader_valid:
loss=compute_loss(...)
loss=loss.item() ## release tensor
But in the toolbox, with torch.no_grad() is not called during validation ! Besides, there are the other downsides when calling loss.backward() in the validation mode. Not only it's wrong but also it's a waste of time to do backpropagation when we don't need the gradient.
Code example
I write a code to mimic the training of srl-zoo, and demonstrate that the current training mechanism will use the gradient of loss on the validation data to update the model weights.
By switching between CASE = 0 (no validation) or CASE = 1 (with wrong validation mechanism), you will see one linear layer is sufficient to learn the task (CASE = 0 e.g. Train Loss: 0.0000001010). However, in the CASE = 1, the gradient of validation set will affect the training and the model will not converge (e.g. Train Loss: 1.0033100843 | Val Loss: 1.0033100843).
SOLUTION
It's easy, just remove loss.backward() in the validation mode and add torch.set_grad_enabled(False/True) at the beginning/end of the validation. This will also provide around 5-10% speed-up depending on the model.
The text was updated successfully, but these errors were encountered:
Hello,
The gradients are computed but no gradient step is taken: optimizer.zero_grad() is called before it happens.
However, I agreee that having two dataloaders would be cleaner.
I'm working on a pull request for all the issues I posted. This problem (two dataloaders) has been addressed, although I need more time to make sure everything is fine before I pull request. For the impatient people, you can take a look at my fork: https://github.com/ncble/srl-zoo/tree/adv_srl
Describe the bug
I hope I'm wrong but it seems that the current training mechanism uses the validation set to update the weights during training which is unwanted. The problem comes from the fact that we calculate the gradient by
loss.backward()
in both training and validation.The curring training mechanism can be reduce to the following pseudo code (in srl-zoo/models/learner.py)
The common way to validate a model in Pytorch should look like the following:
But in the toolbox,
with torch.no_grad()
is not called during validation ! Besides, there are the other downsides when callingloss.backward()
in the validation mode. Not only it's wrong but also it's a waste of time to do backpropagation when we don't need the gradient.Code example
I write a code to mimic the training of srl-zoo, and demonstrate that the current training mechanism will use the gradient of loss on the validation data to update the model weights.
By switching between
CASE = 0
(no validation) orCASE = 1
(with wrong validation mechanism), you will see one linear layer is sufficient to learn the task (CASE = 0
e.g. Train Loss: 0.0000001010). However, in theCASE = 1
, the gradient of validation set will affect the training and the model will not converge (e.g. Train Loss: 1.0033100843 | Val Loss: 1.0033100843).SOLUTION
It's easy, just remove
loss.backward()
in the validation mode and addtorch.set_grad_enabled(False/True)
at the beginning/end of the validation. This will also provide around 5-10% speed-up depending on the model.The text was updated successfully, but these errors were encountered: