`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`) #889

GeorgeBatch · 2024-11-29T14:41:26Z

TIA Toolbox version: develop branch
Python version: 3.11.8
Operating System: linux

Description

I am computing the features using multiple GPUs on the same node using DeepFeatureExtractor
My code for extracting features is pretty much the same as shown in the new notebook showing the feature extraction process: #887

What I Did

nn.DataParallel built-in within tiatoolbox handles the multi-gpu computations. I pulled the changes that introduced torch.compile and changed from ON_GPU to using device.

I updated the argument in the DeepFeatureExtractor's predict method to use device instead of on_gpu.

Errors traceback is very long to paste it all. But here are some of the errors (from the single run).

  File "/tmp/torchinductor_qun786/vv/cvvkeueuq2m4jcjzub4hcfpkhpogtc5b2xddykdgxvsxcvnpfa2w.py", line 173, in call                                               
    buf2 = extern_kernels.convolution(buf0, buf1, stride=(14, 14), padding=(0, 0), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=Non
e)                                                                                                                                                                                                                                                                                                                
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in 
method wrapper_CUDA__cudnn_convolution)  

...

    raise exception                                                                                                                                            
RuntimeError: Caught RuntimeError in replica 0 on device 0.  

...

RuntimeError: Triton Error [CUDA]: invalid device context

What I can gather is that torch.compile is not working well with nn.DataParallel.

The text was updated successfully, but these errors were encountered:

GeorgeBatch · 2024-11-29T16:05:55Z

Please let me know if you can reproduce the error by simply running the DeepFeatureExtractor feature extraction code with rcParam["torch_compile_mode"] = "default" on a node with at least 2 devices.

Maybe nn.DistributedDataParallel is a better option to use: https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead

tiatoolbox/tiatoolbox/models/models_abc.py

Lines 42 to 61 in 5f1cecb

    
           def model_to(model: torch.nn.Module, device: str = "cpu") -> torch.nn.Module: 
        
               """Transfers model to specified device e.g., "cpu" or "cuda". 
        
               Args: 
        
                   model (torch.nn.Module): 
        
                       PyTorch defined model. 
        
                   device (str): 
        
                       Transfers model to the specified device. Default is "cpu". 
        
               Returns: 
        
                   torch.nn.Module: 
        
                       The model after being moved to specified device. 
        
               """ 
        
               if device != "cpu": 
        
                   # DataParallel work only for cuda 
        
                   model = torch.nn.DataParallel(model) 
        
               device = torch.device(device) 
        
               return model.to(device)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`) #889

`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`) #889

GeorgeBatch commented Nov 29, 2024 •

edited

Loading

GeorgeBatch commented Nov 29, 2024 •

edited

Loading

torch.compile issue when computing features on multiple GPUs (nn.DataParallel) #889

torch.compile issue when computing features on multiple GPUs (nn.DataParallel) #889

Comments

GeorgeBatch commented Nov 29, 2024 • edited Loading

Description

What I Did

GeorgeBatch commented Nov 29, 2024 • edited Loading

`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`) #889

`torch.compile` issue when computing features on multiple GPUs (`nn.DataParallel`) #889

GeorgeBatch commented Nov 29, 2024 •

edited

Loading

GeorgeBatch commented Nov 29, 2024 •

edited

Loading