You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to run some of the exp training code onnvcr.io/nvidia/pytorch:23.09-py3 . However, I seem to keep getting errors regardless of the scripts. After some testing, it seems that even running GPT training on a single GPU causes error.
What might be the cause of this INTERNAL ASSERT FAILED (shown below)? My bash script and logs are shown below.
> elasped time to build and save sample-idx mapping (seconds): 0.000692
> building shuffle index with split [0, 51039) and [51039, 52173) ...
> elasped time to build and save shuffle-idx mapping (seconds): 0.001060
> loading doc-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 52174
total number of epochs: 46
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-11-14 00:12:12
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (65.30, 65.30)
train/valid/test-data-iterators-setup ..........: (5738.83, 5738.83)
training ...
[before the start of training step] datetime: 2024-11-14 00:12:12
Traceback (most recent call last):
File "/mount/Megatron-LM-stanford/pretrain_gpt.py", line 154, in <module>
pretrain(train_valid_test_datasets_provider, model_provider,
File "/mount/Megatron-LM-stanford/megatron/training.py", line 147, in pretrain
iteration = train(forward_step_func,
File "/mount/Megatron-LM-stanford/megatron/training.py", line 712, in train
train_step(forward_step_func,
File "/mount/Megatron-LM-stanford/megatron/training.py", line 421, in train_step
losses_reduced = forward_backward_func(
File "/mount/Megatron-LM-stanford/megatron/schedules.py", line 263, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/mount/Megatron-LM-stanford/megatron/schedules.py", line 133, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/mount/Megatron-LM-stanford/pretrain_gpt.py", line 124, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/distributed.py", line 59, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/module.py", line 184, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/gpt_model.py", line 80, in forward
lm_output = self.language_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/language_model.py", line 432, in forward
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 1227, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 739, in forward
self.self_attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 601, in forward
context_layer = self.core_attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 313, in forward
attention_probs = self.attention_dropout(attention_probs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1268, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.
[2024-11-14 00:12:17,743] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2892) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-14_00:12:17
host : 0c9f17b7c8c7
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2892)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
The text was updated successfully, but these errors were encountered:
I have been trying to run some of the exp training code onnvcr.io/nvidia/pytorch:23.09-py3 . However, I seem to keep getting errors regardless of the scripts. After some testing, it seems that even running GPT training on a single GPU causes error.
What might be the cause of this INTERNAL ASSERT FAILED (shown below)? My bash script and logs are shown below.
Bash script executed:
Log file (error section)
The text was updated successfully, but these errors were encountered: