-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel #1292
Comments
i found it's because chain bucketing order is not matched with forward path |
Can you share a small reproduction script? |
@deepakn94 hi deepak, good to see you! but in this scenario, i added custom layers in the end of the transformer layer block (after this line) class TransformerLayer(MegatronModule, BaseTransformerLayer):
def __init__(
self,
config: TransformerConfig,
submodules: TransformerLayerSubmodules,
layer_number: int = 1,
hidden_dropout: float = None,
):
...
# [Module 8: MLP block]
# TODO how to set the gpt_layer_spec.py when we have moe_frequency > 1,
# where MLP and MoE layer both appear alternately?
self.mlp = build_module(submodules.mlp, config=self.config)
if hasattr(self.mlp, 'set_layer_number'):
self.mlp.set_layer_number(self.layer_number)
# [Module 9: BiasDropoutFusion]
self.mlp_bda = build_module(submodules.mlp_bda)
# @jcasper how should we handle nvfuser?
# Set bias+dropout+add fusion grad_enable execution handler.
# TORCH_MAJOR = int(torch.__version__.split('.')[0])
# TORCH_MINOR = int(torch.__version__.split('.')[1])
# use_nvfuser = TORCH_MAJOR > 1 or (TORCH_MAJOR == 1 and TORCH_MINOR >= 10)
# self.bias_dropout_add_exec_handler = nullcontext if use_nvfuser else torch.enable_grad
self.bias_dropout_add_exec_handler = torch.enable_grad
## here, custom layers are added in the end of init
self.attn_out_rmsnorm = ...
self.fc2_rmsnorm = ... but this layers does not forwarded in this order, # If current bucket's param AG has not been dispatched, dispatch it now (e.g., first
# AG bucket in first model chunk if ddp_config.align_param_gather is False).
if not self.param_gather_dispatched:
self.start_param_sync()
if self.param_gather_handle is not None:
self.param_gather_handle.wait()
self.param_gather_handle = None
# Dispatch next bucket's asynchronous param AG.
if self.next_param_gather_bucket_group is not None and not skip_next_bucket_dispatch:
self.next_param_gather_bucket_group.start_param_sync() so i fixed above code snippet like this if self.param_gather_handle is not None:
self.param_gather_handle.wait()
self.param_gather_handle = None
# Dispatch next bucket's asynchronous param AG.
if (
self.next_param_gather_bucket_group is not None and not skip_next_bucket_dispatch
) and (
not self.next_param_gather_bucket_group.param_gather_dispatched
):
self.next_param_gather_bucket_group.start_param_sync() if my explanation lacks information, please reply again or email me, ty! |
it's 4 node experiment where i used distributed_optimizer, overlap param gather and grad all reduce as True and tp=2, pp=4.
idk why next linear fc2's next_param_gather_bucket_group has asnyc param_gather context manager ...?
The text was updated successfully, but these errors were encountered: