Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run in multi-card #8

Open
WEIYanbin1999 opened this issue Dec 28, 2024 · 2 comments
Open

How to run in multi-card #8

WEIYanbin1999 opened this issue Dec 28, 2024 · 2 comments

Comments

@WEIYanbin1999
Copy link

Dear Authors,
I noticed that you have annotate multicard in your training command, is this code repository tolerant to multi-card training. How can I enable this?

@WEIYanbin1999
Copy link
Author

WEIYanbin1999 commented Dec 28, 2024

I tried to use the command:

MODEL_PATH=gpt2-large

DATASET=data/$1/
WEIGHT_DECAY=$2
N_LAYERS=$3
GPU=$4

OUTPUT_DIR=checkpoint/$1_$2_$3

# CUDA_VISIBLE_DEVICES=$GPU python main.py \
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port 12345 main.py \
--data_dir $DATASET \
--model_name_or_path ${MODEL_PATH} \
--weight_decay $WEIGHT_DECAY \
--output_dir $OUTPUT_DIR \
--max_seq_length 10 \
--max_length 10 \
--block_size 10 \
--train_batch_size 512 \
--eval_batch_size 512 \
--learning_rate 1e-4 \
--gradient_accumulation_steps 1 \
--save_step 50000 \
--save_step_dense 40000 \
--max_steps 1500000 \
--do_train \
--scheduler constant_schedule_with_warmup \
--fp16 \
--evaluate_during_training \
--predict_during_training \
--add_tokens \
--n_layer $N_LAYERS

But it stuck at:
image

@WEIYanbin1999
Copy link
Author

WEIYanbin1999 commented Dec 28, 2024

It finally exit with CHILD Error:

RuntimeError: DDP expects same model across all ranks, but Rank 1 has 436 params, while rank 0 has inconsistent 0 params.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2043775) of binary: /data/home/weiyb/miniconda3/envs/grokking/bin/python
Traceback (most recent call last):
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
main.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-12-28_18:45:58
  host      : ps
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 2043776)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2043776
[2]:
  time      : 2024-12-28_18:45:58
  host      : ps
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 2043777)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2043777

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant