-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training: CUDA: Out of Memory Optimizations #4
Comments
Thank you for your interest in our paper! Can you make sure you installed the packages following instructions on README, and that the code is unchanged? Training the privileged agent using the default batch size will not take up 10GB memory, so also double check that you don't have other programs running on that GPU. |
Hi @dianchen96 , thank you for the quick response. Apart from that, I'm using pytorch 1.0.0 version py3.5_cuda10.0.130_cudnn7.4.1_1 instead of the suggested py3.5_cuda8.0.61_cudnn7.1.2_1 , since my GPU was raising warnings to use a newer version of CUDA. The benchmark_agent.py script worked without any issues |
Can you test out with smaller batch size, like 32 or 16 and see if those OOM? |
UPDATE Thanks |
In the train loop, try changing
to
|
Hi, location_preds = [location_pred(h) for location_pred in self.location_pred] Stack of those location predictions seems to make my GPU run out of memory. ( An RTX 2080 with 11 GB of available memory) Any suggestions regarding these? |
Hmm interesting, we have not experienced this, not sure how much this is due to the hardware or cuda/cudnn mismatch. Does the OMM happen right after this operation, or does it happen during the backward pass? |
It happens right after the operation. I am currently using cuda 10.0.130 with cudnn 7.6.0 |
I had the same issue with I was able to run with a minibatch size of 128. Delving deeper into the problem, I noticed reserved memory doubling from the end of the first iteration to the end of the second.
I opened #19, which explicitly deletes device-converted tensors at the end of each iteration, to let Python/PyTorch know the memory is free again. I'm not sure if this has always been necessary in context of iterations, but in general the following pattern is quite memory inefficient: import torch
foo = torch.zeros(512, 512, 512).to(0)
# At this point 512 MB is used
# This reserves a further 512 MB, so 1024 MB is needed
foo = torch.zeros(512, 512, 512).to(0)
# The previous object is freed. Now 512 MB is used again (Torch still keeps the "free" 512MB as cache, unless explicitly freed) |
@Kin-Zhang If you apply this PR you can run at 2x greater batch size: #19. |
Hi,
A wonderful paper and thanks for providing the implementation, so that one could reproduce the results.
I have tried training the privileged agent using the script, as mentioned in the README
python train_birdview.py --dataset_dir=../data/sample --log_dir=../logs/sample
I get a Runtime Error : Tried to allocate 144.00 MiB (GPU 0; 10.73 GiB total capacity; 9.77 GiB already allocated; 74.62 MiB free; 69.10 MiB cached). followed by a ConnectionResetError
I tried tracing back the error using nvidia-smi and found that the memory usage quickly builds up ( reaching the maximum) before the training begins.
Any leads and suggestions are much appreciated.
Thanks
Attaching the full stack trace for further reference
The text was updated successfully, but these errors were encountered: