Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error When Running Single GPU Experiment #161

Open
kevin3567 opened this issue Nov 14, 2024 · 0 comments
Open

CUDA Error When Running Single GPU Experiment #161

kevin3567 opened this issue Nov 14, 2024 · 0 comments

Comments

@kevin3567
Copy link

kevin3567 commented Nov 14, 2024

I have been trying to run some of the exp training code onnvcr.io/nvidia/pytorch:23.09-py3 . However, I seem to keep getting errors regardless of the scripts. After some testing, it seems that even running GPT training on a single GPU causes error.

What might be the cause of this INTERNAL ASSERT FAILED (shown below)? My bash script and logs are shown below.

Bash script executed:

### Pre-training for GPT2 125M parameter.
##

# Distributed hyperparameters.
DISTRIBUTED_ARGUMENTS="\
--nproc_per_node 1 \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"

# Model hyperparameters.
MODEL_ARGUMENTS="\
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 1024 \
--max-position-embeddings 1024"

# Training hyperparameters.
TRAINING_ARGUMENTS="\
--micro-batch-size 32 \
--global-batch-size 512 \
--train-iters ${TRAINING_STEPS} \
--lr-decay-iters ${TRAINING_STEPS} \
--lr 0.00015 \
--min-lr 0.00001 \
--lr-decay-style cosine \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--init-method-std 0.01"

DATA_PATH=my-gpt2_text_document

# NOTE: We don't train for enough tokens for the
# split to matter.
DATA_ARGUMENTS="\
--data-path ${DATA_PATH} \
--vocab-file ./ckpt_gpt/gpt2-vocab.json \
--merge-file ./ckpt_gpt/gpt2-merges.txt \
--make-vocab-size-divisible-by 1024 \
--split 969,30,1"

COMPUTE_ARGUMENTS="\
--fp16 \
--DDP-impl local"

CHECKPOINT_ARGUMENTS="\
--save-interval 2000 \
--save ./${EXP_DIR}"

EVALUATION_ARGUMENTS="\
--eval-iters 100 \
--log-interval 100 \
--eval-interval 1000"

torchrun ${DISTRIBUTED_ARGUMENTS} \
       pretrain_gpt.py \
       ${MODEL_ARGUMENTS} \
       ${TRAINING_ARGUMENTS} \
       ${DATA_ARGUMENTS} \
       ${COMPUTE_ARGUMENTS} \
       ${CHECKPOINT_ARGUMENTS} \
       ${EVALUATION_ARGUMENTS} |& tee ./${EXP_DIR}/train.log

Log file (error section)

> elasped time to build and save sample-idx mapping (seconds): 0.000692
 > building shuffle index with split [0, 51039) and [51039, 52173) ...
 > elasped time to build and save shuffle-idx mapping (seconds): 0.001060
 > loading doc-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 52174
    total number of epochs: 46
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-11-14 00:12:12 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (65.30, 65.30)
    train/valid/test-data-iterators-setup ..........: (5738.83, 5738.83)
training ...
[before the start of training step] datetime: 2024-11-14 00:12:12 
Traceback (most recent call last):
  File "/mount/Megatron-LM-stanford/pretrain_gpt.py", line 154, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider,
  File "/mount/Megatron-LM-stanford/megatron/training.py", line 147, in pretrain
    iteration = train(forward_step_func,
  File "/mount/Megatron-LM-stanford/megatron/training.py", line 712, in train
    train_step(forward_step_func,
  File "/mount/Megatron-LM-stanford/megatron/training.py", line 421, in train_step
    losses_reduced = forward_backward_func(
  File "/mount/Megatron-LM-stanford/megatron/schedules.py", line 263, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator,
  File "/mount/Megatron-LM-stanford/megatron/schedules.py", line 133, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/mount/Megatron-LM-stanford/pretrain_gpt.py", line 124, in forward_step
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/distributed.py", line 59, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/module.py", line 184, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/gpt_model.py", line 80, in forward
    lm_output = self.language_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/language_model.py", line 432, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 1227, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 739, in forward
    self.self_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 601, in forward
    context_layer = self.core_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 313, in forward
    attention_probs = self.attention_dropout(attention_probs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1268, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/driver_api.cpp":15, please report a bug to PyTorch. 
[2024-11-14 00:12:17,743] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2892) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-14_00:12:17
  host      : 0c9f17b7c8c7
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2892)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant