Issues with Training using SLURM/Distributed Learning #2075
Unanswered
dslisleedh
asked this question in
General
Replies: 1 comment 2 replies
-
@dslisleedh I'm not so sure your solution will end up faster based on tests I performed in the past. Ignoring the warning I'd be curious to see the actual throughput numbers between the two options. EDIT: quick check on a convnext model running in convolutional mode, the norm impl you have above (which was I think close to the original impl for convnext) yields a throughput of just over 800 im/sec on a local distributed train test and does not have the stride bucket warning. My current implementation (with the warning) is 1500 im/sec. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I would like to express my gratitude for your exceptional work.
However, I encountered some difficulties during training with SLURM and Distributed Learning.
I opted not to use the training code from timm. My primary training code was based on BasicSR, and I only imported a few layers from timm (
LayerNorm2d
andDropPath
).During training, I faced an issue where the following warning appeared:
UserWarning: Grad strides do not match bucket view strides.
Consequently, the model's performance was bad, as expected.Given that my model is fully convolutional and doesn't involve memory rearrangement operations, this initially puzzled me. After some investigation, I discovered that
LayerNorm2d
in timm usespermute
without contiguous.To address this, I modified my code to use a custom
LayerNorm2d
, as shown below. This change resolved the problem:Although I am not certain how timm addresses this issue while supporting SLURM training, I wanted to share my solution for those who might only be using layers from timm, as I did.
Beta Was this translation helpful? Give feedback.
All reactions