-
I am testing GPT2 model training using TransformerLayer. Training slows down significantly when Do you have any recommendation to achieve better throughput with Model is ~4.3B with 12 layers,
|
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
@ksivaman Could you take a look? |
Beta Was this translation helpful? Give feedback.
-
Which toolkit (e.g. NeMo) are you using to enable sequence parallelism? Does that toolkit support sequence parallel? Simply passing If this is the case, an |
Beta Was this translation helpful? Give feedback.
-
Could you confirm if this is the issue? Either way, I will make a change soon that catches this behavior and reports an error. |
Beta Was this translation helpful? Give feedback.
-
Yes thanks for the help |
Beta Was this translation helpful? Give feedback.
-
Facing the same error, how can I set |
Beta Was this translation helpful? Give feedback.
-
It's not that the |
Beta Was this translation helpful? Give feedback.
-
I am still seeing performance drop when sequence-parallel is enabled even after sharding input during the forward pass before passing it to TE's TransformerLayer. Model is Do you have any recommendation? What should be the optimal configuration of these environment variables?
|
Beta Was this translation helpful? Give feedback.
It's not that the
sequence-parallel
arg does not have to be passed into TE (it must be), but the underlying toolkit, i.e. NeMo must also be aware of SP being used so that it can split the input. This can be done by setting the correspondingsequence-parallel
arg in NeMo.