You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have OOM error during inference but not during training.
This happens even with batch size of 1 and even with increasing the GPU memory.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.25 GiB (GPU 0; 44.42 GiB total capacity; 36.96 GiB already allocated; 3.95 GiB free; 38.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I think its not OOM issue but something wrong with the Trainer increasing reserving memory.
The text was updated successfully, but these errors were encountered:
I guess there could be many factors for occurrence of the OOM error:-
Like if you are using techniques like beam search in NLP models or large batch sizes for evaluation.
Secondly, Memory fragmentation can lead to inefficient use of GPU memory.
Try using this:-
import torch
torch.cuda.set_per_process_memory_fraction(0.9) # Adjust the fraction as needed
torch.backends.cuda.matmul.allow_tf32 = True # Enabling TF32 precision to save memory
or you may even try to clean up the cache before starting inference by this:-
I have OOM error during inference but not during training.
This happens even with batch size of 1 and even with increasing the GPU memory.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.25 GiB (GPU 0; 44.42 GiB total capacity; 36.96 GiB already allocated; 3.95 GiB free; 38.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I think its not OOM issue but something wrong with the Trainer increasing reserving memory.
The text was updated successfully, but these errors were encountered: