Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSLOnlineEvaluator does not work with DDP #953

Open
shubhamkulkarni01 opened this issue Dec 12, 2022 · 3 comments
Open

SSLOnlineEvaluator does not work with DDP #953

shubhamkulkarni01 opened this issue Dec 12, 2022 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed waiting on author won't fix This will not be worked on

Comments

@shubhamkulkarni01
Copy link

shubhamkulkarni01 commented Dec 12, 2022

🐛 Bug

In commit 6e14209185c2b2100f3e515ee6782597673bb921 on pytorch_lightning from Feb 17, the use_ddp property was removed from AcceleratorConnector.

In commit b29b07e9788311326bca4779d70e89eb36bfc57f on pytorch_lightning from Feb 27, the use_dp property was removed from AcceleratorConnector.

The SSLOnlineEvaluator now throws exceptions with multiple GPUs since it checks for these properties in distributed training.

To Reproduce

Steps to reproduce the behavior:

Must run on a system with 2+ GPUs attached and accessible to PyTorch.

  1. Create a pl.Trainer
  2. Attach an SSLOnlineEvaluator Callback
  3. Call trainer.fit

Code sample:

import torch
import pytorch_lightning as pl
import pl_bolts


def main():
    zdim = 2048
    bs = 8

    ds = pl_bolts.datasets.DummyDataset(
            (3, 224, 224),
            (1, ),
            num_samples = 100
    )
    dl = torch.utils.data.DataLoader(ds, batch_size=bs)

    model = pl_bolts.models.self_supervised.SimCLR(
            gpus = torch.cuda.device_count(),
            num_samples = len(ds),
            batch_size = bs,
            dataset = 'custom',
            hidden_mlp = zdim,
    )

# fit
    trainer = pl.Trainer(
        accelerator = 'gpu',
        devices = -1,
        callbacks = [
            pl_bolts.callbacks.SSLOnlineEvaluator(
                z_dim = zdim,
                num_classes = 4, # or any other number
                hidden_dim = None,
                dataset = 'custom'
            ),
        ],
    )

    trainer.fit(model, train_dataloaders = dl)
if __name__ == '__main__':
    main()

Leads to the following

Traceback (most recent call last):
  File "example.py", line 41, in <module>
    main()
  File "example.py", line 39, in main
    trainer.fit(model, train_dataloaders = dl)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 604, in fit
    self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 117, in launch
    start_method=self._start_method,
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
    results = function(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1083, in _run
    self._call_callback_hooks("on_fit_start")
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pl_bolts/callbacks/ssl_online.py", line 87, in on_fit_start
    if accel.use_ddp:
AttributeError: 'AcceleratorConnector' object has no attribute 'use_ddp'

Expected behavior

Environment

  • PyTorch Version (e.g., 1.0): '1.13.0+cu117'
  • Lightning version: '1.8.4.post0'
  • pl_bolts version: '0.6.0.post1'
  • OS (e.g., Linux): Docker (Ubuntu base)
  • How you installed PyTorch (conda, pip, source): Pytorch Docker image
  • Python version: 3.7.11
  • CUDA/cuDNN version: 11.7
  • GPU models and configuration: 2 A10s, 24GB VRAM each

Additional context

Currently have it patched in personal system as follows using the old definition of the use_ddp property prior to removal:

    from pytorch_lightning.trainer.connectors.accelerator_connector import _LITERAL_WARN, AcceleratorConnector
    AcceleratorConnector.use_ddp = lambda self: self._strategy_type in (
            _StrategyType.BAGUA,
            _StrategyType.DDP,
            _StrategyType.DDP_SPAWN,
            _StrategyType.DDP_SHARDED,
            _StrategyType.DDP_SHARDED_SPAWN,
            _StrategyType.DDP_FULLY_SHARDED,
            _StrategyType.DEEPSPEED,
            _StrategyType.TPU_SPAWN,
        )
@shubhamkulkarni01 shubhamkulkarni01 added the help wanted Extra attention is needed label Dec 12, 2022
@stale stale bot added the won't fix This will not be worked on label Mar 18, 2023
@Borda Borda added fix fixing issues... bug Something isn't working and removed fix fixing issues... labels Jun 20, 2023
@Borda Borda changed the title SSLOnlineEvaluator does not work with DDP SSLOnlineEvaluator does not work with DDP Jun 20, 2023
@Lightning-Universe Lightning-Universe deleted a comment from stale bot Jun 20, 2023
@Borda
Copy link
Member

Borda commented Jun 20, 2023

@shubhamkulkarni01, what versions of PL and Bolts are you using?

@shubhamkulkarni01
Copy link
Author

Lightning version: '1.8.4.post0'
pl_bolts version: '0.6.0.post1'

@Borda
Copy link
Member

Borda commented Jun 30, 2023

could you pls try the latest (today) 0.7.0 where we did some compatibility fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed waiting on author won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants