Enabling `sequence_parallel` slows down training with fp16 #283

cavdard · 2023-04-27T23:10:56Z

cavdard
Apr 27, 2023

I am testing GPT2 model training using TransformerLayer.

Training slows down significantly when sequence_parallel=True, achieves 1/5th of throughput of training without sequence_parallel.
I also observe that sequence_parallel=True results in OOM for some batch sizes where sequence_parallel=False can run successfully.

Do you have any recommendation to achieve better throughput with sequence_parallel and fp16?

Model is ~4.3B with 12 layers, tp_size=4, fp16, seq_len=2048, training with 8 A100 GPUs.

transformer_engine.pytorch.TransformerLayer(
5120,
20480,
40,
layer_number=(l+1),
self_attn_mask_type="causal",
tp_group=tp_group(),
tp_size=tp_size,
params_dtype=torch.float16,
output_layernorm=True,
layer_type="encoder",
set_parallel_mode=True,
fuse_qkv_params=True,
sequence_parallel=True,
qkv_weight_interleaved=False,
attention_softmax_in_fp32=False,
)

Answered by ksivaman

Jun 16, 2023

It's not that the sequence-parallel arg does not have to be passed into TE (it must be), but the underlying toolkit, i.e. NeMo must also be aware of SP being used so that it can split the input. This can be done by setting the corresponding sequence-parallel arg in NeMo.

View full answer

ptrendx · 2023-04-28T22:15:39Z

ptrendx
Apr 28, 2023
Maintainer

@ksivaman Could you take a look?

0 replies

rahul003 · 2023-05-09T17:37:44Z

rahul003
May 9, 2023

Hey @ptrendx and @ksivaman Any update on this?

0 replies

ksivaman · 2023-05-09T20:07:22Z

ksivaman
May 9, 2023
Maintainer

Which toolkit (e.g. NeMo) are you using to enable sequence parallelism? Does that toolkit support sequence parallel? Simply passing sequence_parallel=True in TE will not do the trick. The input during the forward pass must also be split in the sequence dimension among the tensor parallel workers before passing it to TE's TransformerLayer.

If this is the case, an unsharded input will be gathered across the TP group during fprop, making the input larger and explaining the extra memory consumption as well as slowdown observed. I think this is what is happening here, does that make sense?

0 replies

ksivaman · 2023-05-09T20:08:22Z

ksivaman
May 9, 2023
Maintainer

Could you confirm if this is the issue? Either way, I will make a change soon that catches this behavior and reports an error.

0 replies

rahul003 · 2023-06-05T20:39:43Z

rahul003
Jun 5, 2023

Yes thanks for the help

0 replies

kback1 · 2023-06-16T18:05:05Z

kback1
Jun 16, 2023

Facing the same error, how can I set sequence_parallel=True in NeMo without passing it into Transformers Engine?

0 replies

ksivaman · 2023-06-16T18:11:15Z

ksivaman
Jun 16, 2023
Maintainer

It's not that the sequence-parallel arg does not have to be passed into TE (it must be), but the underlying toolkit, i.e. NeMo must also be aware of SP being used so that it can split the input. This can be done by setting the corresponding sequence-parallel arg in NeMo.

0 replies

cavdard · 2023-07-10T18:51:24Z

cavdard
Jul 10, 2023
Author

I am still seeing performance drop when sequence-parallel is enabled even after sharding input during the forward pass before passing it to TE's TransformerLayer. Model is ~5.3B with 16 layers, tp_size=8, fp16, bs=8, seq_len=2048, training with 8 H100 GPUs.

Do you have any recommendation? What should be the optimal configuration of these environment variables?

export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_FLASH_ATTN=1
export NVTE_UB_OVERLAP=1
export NVTE_UB_BULK_WGRAD=1
export NVTE_UB_BULK_DGRAD=1
export NVTE_UB_SPLIT_AG=1
export NVTE_UB_SPLIT_RS=1
export NVTE_BIAS_DROPOUT_FUSION=1
export NVTE_BIAS_GELU_NVFUSION=0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling `sequence_parallel` slows down training with fp16 #283

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Enabling sequence_parallel slows down training with fp16 #283

cavdard Apr 27, 2023

Replies: 8 comments

ptrendx Apr 28, 2023 Maintainer

rahul003 May 9, 2023

ksivaman May 9, 2023 Maintainer

ksivaman May 9, 2023 Maintainer

rahul003 Jun 5, 2023

kback1 Jun 16, 2023

ksivaman Jun 16, 2023 Maintainer

cavdard Jul 10, 2023 Author

Enabling `sequence_parallel` slows down training with fp16 #283

cavdard
Apr 27, 2023

ptrendx
Apr 28, 2023
Maintainer

rahul003
May 9, 2023

ksivaman
May 9, 2023
Maintainer

ksivaman
May 9, 2023
Maintainer

rahul003
Jun 5, 2023

kback1
Jun 16, 2023

ksivaman
Jun 16, 2023
Maintainer

cavdard
Jul 10, 2023
Author