-
BackgroundI am currently attempting to train an EfficientVIT-M2 model from scratch, utilizing the unaltered Problem DescriptionThroughout the training process, I've observed a consistent increase in VRAM usage, eventually leading to overflow into shared GPU memory. This overflow significantly slows down the training. Initially, I achieve a training speed of approximately 5000-6000 samples per second for the first two epochs. However, by the third epoch, the speed drops dramatically to about 1600 samples per second. For the lite0 model it starts out at about 2000 samples per second and decreases to 1000-1500 samples per second. Despite numerous attempts to mitigate this issue, I have not found a solution. Steps Taken
Screenshots
Has anyone encountered similar issues or have suggestions on potential fixes? Any insights or advice would be greatly appreciated. Command:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
After encountering significant VRAM overflow issues during the training of an EfficientVIT-M2 model, I developed a workaround. It's important to note that my explanation for why this solution works is based on a theory regarding the NVIDIA driver's memory management behavior. I theorize that the underlying issue arises from the NVIDIA driver's memory manager (In Windows), which appears to attempt optimizing VRAM usage by preemptively transferring data to shared GPU memory. This seems to occur to prevent complete VRAM saturation, with the process starting when VRAM usage is just shy of its maximum capacity (around 9.8GB in my scenario), leaving about 200MB of VRAM "free." PyTorch, recognizing this free space, continues to allocate memory, prompting the NVIDIA driver to offload data once more to shared memory. Proposed WorkaroundTo address this issue, I introduced a modification in the torch.cuda.set_per_process_memory_fraction(0.95) This line instructs PyTorch to utilize only up to 95% of the available VRAM, theoretically preventing the NVIDIA driver from its preemptive eviction process. By implementing this limit, the goal is to maintain a stable VRAM usage without incurring the performance penalties associated with overflow into shared GPU memory. Note: My hypothesis about the NVIDIA driver's behavior and the effectiveness of this workaround are based on observations and testing in my specific environment. |
Beta Was this translation helpful? Give feedback.
-
Update NVIDIA driver if you did not and force CUDA option to PREFER NO SYSMEM FALLBACK, so it will disable shared GPU memory and you will get OOM again :-) |
Beta Was this translation helpful? Give feedback.
After encountering significant VRAM overflow issues during the training of an EfficientVIT-M2 model, I developed a workaround. It's important to note that my explanation for why this solution works is based on a theory regarding the NVIDIA driver's memory management behavior.
I theorize that the underlying issue arises from the NVIDIA driver's memory manager (In Windows), which appears to attempt optimizing VRAM usage by preemptively transferring data to shared GPU memory. This seems to occur to prevent complete VRAM saturation, with the process starting when VRAM usage is just shy of its maximum capacity (around 9.8GB in my scenario), leaving about 200MB of VRAM "free." PyTorch, recogniz…