Training hangs after processing 200,000 samples on FreeSound and FMA datasets with DIT model #147

shwj114514 · 2024-09-27T03:48:07Z

While training the DIT model on the FreeSound1 (I choose 250k audio clips) and FMA datasets, the training hangs after processing approximately 198,400 samples (with 8 GPUs, batch size of 8 per GPU, reaching 3,100 steps). After some time, an NCCL communication timeout occurs. I tested lowering the batch size to 6, but the same issue appeared after processing 198,600 samples (~4,100 steps). Interestingly, when I reduce the total number of samples in the FreeSound dataset to 150k, training proceeds without issues. Could this be related to the dataset size or NCCL synchronization across GPUs?

During NCCL communication wait, half of the GPUs show 0% utilization while the other half show max utilization, but in reality, none of the GPUs are working (power consumption is the same as in idle state).

Here are my training logs from WandB. As shown, the training loss stopped updating at 3000 steps, but memory usage continued to be logged. I’ve already ruled out dataset issues and CUDA out of memory errors.

nateraw · 2024-10-01T23:59:49Z

What kind of dataset you using? samples/local webdataset/s3 webdataset?

can you send along the GPU memory utilization charts as well?

Are you using default multi-gpu strategy, or deepspeed?

anything different about your conditioning signals?

I've noticed memory leaks in the loader quite a bit, especially when using custom metadata modules. Usually solved by reducing num workers, but it appears your num workers is already set reasonably low I think

shwj114514 · 2024-10-09T02:44:30Z

What kind of dataset you using? samples/local webdataset/s3 webdataset?

I am using the sample dataset from https://github.com/Stability-AI/stable-audio-tools/blob/main/stable_audio_tools/data/dataset.py#L122

Yes, I am training with 8 GPUs on a single machine, using the strategy ddp_find_unused_parameters_true. However, in recent experiments, I found that when using DeepSpeed for training, everything works fine.
My conditioning is consistent with your code, using the T5 embeddings from the prompt.

Additionally, I set num_workers to 4, previously it was 8. I found that even after lowering the number of workers, the error still occurs.

nateraw · 2024-10-09T07:37:36Z

seems like you're running out of CPU memory due to memory leak I mentioned...this problem was a bigger issue for me when I was using local files + custom metadata module that loaded extra text/json data for doing text conditioning. Problem became slightly less bad with WebDataset (where json data is baked into tar shards).

shwj114514 · 2024-10-15T06:42:10Z

seems like you're running out of CPU memory due to memory leak I mentioned...this problem was a bigger issue for me when I was using local files + custom metadata module that loaded extra text/json data for doing text conditioning. Problem became slightly less bad with WebDataset (where json data is baked into tar shards).

I think I’ve resolved the problem: The Freesound dataset contains some extremely long audio files (e.g., 200 HOURS of Nothing.wav is 12GB). Although torchaudio and dataloader can load these files individually, during multi-GPU training with DDP, the dataloader crashes without throwing an error. I suspect this is due to a heap overflow in memory (especially when num_workers > 0). Once I filtered out the oversized audio files from Freesound, the training ran smoothly.

Thank you very much for your help!

shwj114514 changed the title ~~Training hangs after processing 198,400 samples on FreeSound and FMA datasets with DIT model~~ Training hangs after processing 200,000 samples on FreeSound and FMA datasets with DIT model Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training hangs after processing 200,000 samples on FreeSound and FMA datasets with DIT model #147

Training hangs after processing 200,000 samples on FreeSound and FMA datasets with DIT model #147

shwj114514 commented Sep 27, 2024 •

edited

Loading

nateraw commented Oct 1, 2024

shwj114514 commented Oct 9, 2024

nateraw commented Oct 9, 2024

shwj114514 commented Oct 15, 2024

Training hangs after processing 200,000 samples on FreeSound and FMA datasets with DIT model #147

Training hangs after processing 200,000 samples on FreeSound and FMA datasets with DIT model #147

Comments

shwj114514 commented Sep 27, 2024 • edited Loading

nateraw commented Oct 1, 2024

shwj114514 commented Oct 9, 2024

nateraw commented Oct 9, 2024

shwj114514 commented Oct 15, 2024

shwj114514 commented Sep 27, 2024 •

edited

Loading