Issues with Training using SLURM/Distributed Learning #2075

dslisleedh · 2024-01-15T03:25:57Z

dslisleedh
Jan 15, 2024

Hello, I would like to express my gratitude for your exceptional work.
However, I encountered some difficulties during training with SLURM and Distributed Learning.

I opted not to use the training code from timm. My primary training code was based on BasicSR, and I only imported a few layers from timm (LayerNorm2d and DropPath).

During training, I faced an issue where the following warning appeared: UserWarning: Grad strides do not match bucket view strides. Consequently, the model's performance was bad, as expected.

Given that my model is fully convolutional and doesn't involve memory rearrangement operations, this initially puzzled me. After some investigation, I discovered that LayerNorm2d in timm uses permute without contiguous.
To address this, I modified my code to use a custom LayerNorm2d, as shown below. This change resolved the problem:

class LayerNorm2d(nn.Module):
    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_first"):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape))
        self.bias = nn.Parameter(torch.zeros(normalized_shape))
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise NotImplementedError
        self.normalized_shape = (normalized_shape, )

    def forward(self, x):
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            u = x.mean(1, keepdim=True)
            s = (x - u).pow(2).mean(1, keepdim=True)
            x = (x - u) / torch.sqrt(s + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x

Although I am not certain how timm addresses this issue while supporting SLURM training, I wanted to share my solution for those who might only be using layers from timm, as I did.

rwightman · 2024-01-15T18:25:43Z

rwightman
Jan 15, 2024
Maintainer

@dslisleedh I'm not so sure your solution will end up faster based on tests I performed in the past. Ignoring the warning I'd be curious to see the actual throughput numbers between the two options.

EDIT: quick check on a convnext model running in convolutional mode, the norm impl you have above (which was I think close to the original impl for convnext) yields a throughput of just over 800 im/sec on a local distributed train test and does not have the stride bucket warning. My current implementation (with the warning) is 1500 im/sec.

2 replies

dslisleedh Jan 16, 2024
Author

@rwightman, I appreciate your insights. My intention for to use custom(actually original implementation from convnext) LayerNorm is primarily for use during training to avoid the stride bucket warning.

In my experience, this UserWarning: Grad strides do not match bucket view strides. warning does not occur when training on a single GPU or using PyTorch's distributed learning. It first appeared when I used distributed learning with SLURM, as shown in the following example:

# Example
PYTHONPATH="./:${PYTHONPATH}" GLOG_vmodule=MemcachedClient=-1 MASTER_PORT=1231 srun -p gpu1 --nodelist=n012 --mpi=pmi2 --job-name=job7 \
    --gres=gpu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=12 --kill-on-bad-exit=1 \
    python -u train.py -opt aaa_DIV2K_X4_256_36_777.yaml --launcher="slurm"

My focus was not necessarily on optimizing LayerNorm for speed but rather on preventing the stride bucket warning during training(with many GPUs under slrum).

Or I think just using contiguous() bef and after F.layernorm can resolve this when training.

...
    def forward(self, x):
        if self.training:
             x_T = x.permute(0, 2, 3, 1).contiguous()
             x_T = F.layer_norm(x_T, ...)
             return x_T.permute(0, 3, 1, 2).contiguous()
        else:
            return x.permute(F.layer_norm(X.permute(...), ...)

Thank you for taking the time to share your thoughts and results.

rwightman Jan 16, 2024
Maintainer

@dslisleedh my point was the warning isn't usually a significant concern, steps made to prevent the warning (adding extra contiguous() or manually writing the normalization from lower level primitives) often make performance worse or at most no better, sometimes much worse or impacting inference use cases later....

The warning is not at all specific to slurm use and can be reproduced in any distributed training situation with these sort of transpose/permutation operations mixed with conv2d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Training using SLURM/Distributed Learning #2075

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Issues with Training using SLURM/Distributed Learning #2075

dslisleedh Jan 15, 2024

Replies: 1 comment · 2 replies

rwightman Jan 15, 2024 Maintainer

dslisleedh Jan 16, 2024 Author

rwightman Jan 16, 2024 Maintainer

dslisleedh
Jan 15, 2024

Replies: 1 comment 2 replies

rwightman
Jan 15, 2024
Maintainer

dslisleedh Jan 16, 2024
Author

rwightman Jan 16, 2024
Maintainer