How to run in multi-card #8

WEIYanbin1999 · 2024-12-28T10:10:57Z

Dear Authors,
I noticed that you have annotate multicard in your training command, is this code repository tolerant to multi-card training. How can I enable this?

WEIYanbin1999 · 2024-12-28T10:25:54Z

I tried to use the command:

MODEL_PATH=gpt2-large

DATASET=data/$1/
WEIGHT_DECAY=$2
N_LAYERS=$3
GPU=$4

OUTPUT_DIR=checkpoint/$1_$2_$3

# CUDA_VISIBLE_DEVICES=$GPU python main.py \
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port 12345 main.py \
--data_dir $DATASET \
--model_name_or_path ${MODEL_PATH} \
--weight_decay $WEIGHT_DECAY \
--output_dir $OUTPUT_DIR \
--max_seq_length 10 \
--max_length 10 \
--block_size 10 \
--train_batch_size 512 \
--eval_batch_size 512 \
--learning_rate 1e-4 \
--gradient_accumulation_steps 1 \
--save_step 50000 \
--save_step_dense 40000 \
--max_steps 1500000 \
--do_train \
--scheduler constant_schedule_with_warmup \
--fp16 \
--evaluate_during_training \
--predict_during_training \
--add_tokens \
--n_layer $N_LAYERS

But it stuck at:

WEIYanbin1999 · 2024-12-28T11:08:30Z

It finally exit with CHILD Error:

RuntimeError: DDP expects same model across all ranks, but Rank 1 has 436 params, while rank 0 has inconsistent 0 params.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2043775) of binary: /data/home/weiyb/miniconda3/envs/grokking/bin/python
Traceback (most recent call last):
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/home/weiyb/miniconda3/envs/grokking/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
main.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-12-28_18:45:58
  host      : ps
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 2043776)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2043776
[2]:
  time      : 2024-12-28_18:45:58
  host      : ps
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 2043777)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 2043777

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run in multi-card #8

How to run in multi-card #8

WEIYanbin1999 commented Dec 28, 2024

WEIYanbin1999 commented Dec 28, 2024 •

edited

Loading

WEIYanbin1999 commented Dec 28, 2024 •

edited

Loading

How to run in multi-card #8

How to run in multi-card #8

Comments

WEIYanbin1999 commented Dec 28, 2024

WEIYanbin1999 commented Dec 28, 2024 • edited Loading

WEIYanbin1999 commented Dec 28, 2024 • edited Loading

WEIYanbin1999 commented Dec 28, 2024 •

edited

Loading

WEIYanbin1999 commented Dec 28, 2024 •

edited

Loading