Bug in tools/train.py #59

leonmakise · 2020-06-05T18:30:59Z

It worked by single GPU traing. But it failed no matter how many GPUs I appointed when I tried distributed training.

Line 109 in 4bc605e

    
           self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank],

It shall be self.model.cuda()

It works when I change this line.

The following part is the message of error I met with the former code:
(faceparsing) mjq@amax:~/SegmenTron$ CUDA_VISIBLE_DEVICES=0,7 ./tools/dist_train.sh ${CONFIG_FILE} configs/pascal_voc_deeplabv3_plus.yaml ${GPU_NUM} 2

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

2020-06-06 02:21:55,815 Segmentron INFO: Using 2 GPUs
2020-06-06 02:21:55,816 Segmentron INFO: Namespace(config_file='configs/pascal_voc_deeplabv3_plus.yaml', device='cuda', distributed=True, input_img='tools/demo_vis.png', local_rank=0, log_iter=10, no_cuda=False, num_gpus=2, opts=[], resume=None, skip_val=False, val_epoch=1)
2020-06-06 02:21:55,816 Segmentron INFO: {
"SEED": 1024,
"TIME_STAMP": "2020-06-06-02-21",
"ROOT_PATH": "/data1/mjq/SegmenTron",
"PHASE": "train",
"DATASET": {
"NAME": "pascal_voc",
"MEAN": [
0.5,
0.5,
0.5
],
"STD": [
0.5,
0.5,
0.5
],
"IGNORE_INDEX": -1,
"WORKERS": 4,
"MODE": "val"
},
"AUG": {
"MIRROR": true,
"BLUR_PROB": 0.0,
"BLUR_RADIUS": 0.0,
"COLOR_JITTER": null
},
"TRAIN": {
"EPOCHS": 50,
"BATCH_SIZE": 4,
"CROP_SIZE": 480,
"BASE_SIZE": 520,
"MODEL_SAVE_DIR": "runs/checkpoints/",
"LOG_SAVE_DIR": "runs/logs/",
"PRETRAINED_MODEL_PATH": "",
"BACKBONE_PRETRAINED": true,
"BACKBONE_PRETRAINED_PATH": "",
"RESUME_MODEL_PATH": "",
"SYNC_BATCH_NORM": true,
"SNAPSHOT_EPOCH": 10
},
"SOLVER": {
"LR": 0.0001,
"OPTIMIZER": "sgd",
"EPSILON": 1e-08,
"MOMENTUM": 0.9,
"WEIGHT_DECAY": 0.0001,
"DECODER_LR_FACTOR": 10.0,
"LR_SCHEDULER": "poly",
"POLY": {
"POWER": 0.9
},
"STEP": {
"GAMMA": 0.1,
"DECAY_EPOCH": [
10,
20
]
},
"WARMUP": {
"EPOCHS": 0.0,
"FACTOR": 0.3333333333333333,
"METHOD": "linear"
},
"OHEM": false,
"AUX": false,
"AUX_WEIGHT": 0.4,
"LOSS_NAME": ""
},
"TEST": {
"TEST_MODEL_PATH": "",
"BATCH_SIZE": 8,
"CROP_SIZE": null,
"SCALES": [
1.0
],
"FLIP": false
},
"VISUAL": {
"OUTPUT_DIR": "../runs/visual/"
},
"MODEL": {
"MODEL_NAME": "DeepLabV3_Plus",
"BACKBONE": "xception65",
"BACKBONE_SCALE": 1.0,
"MULTI_LOSS_WEIGHT": [
1.0
],
"DEFAULT_GROUP_NUMBER": 32,
"DEFAULT_EPSILON": 1e-05,
"BN_TYPE": "BN",
"BN_EPS_FOR_ENCODER": 0.001,
"BN_EPS_FOR_DECODER": null,
"OUTPUT_STRIDE": 16,
"BN_MOMENTUM": null,
"DEEPLABV3_PLUS": {
"USE_ASPP": true,
"ENABLE_DECODER": true,
"ASPP_WITH_SEP_CONV": true,
"DECODER_USE_SEP_CONV": true
},
"CCNET": {
"RECURRENCE": 2
}
}
}
Found 1464 images in the folder datasets/voc/VOC2012
Found 1464 images in the folder datasets/voc/VOC2012
Found 1449 images in the folder datasets/voc/VOC2012
Found 1449 images in the folder datasets/voc/VOC2012
2020-06-06 02:21:56,181 Segmentron INFO: load backbone pretrained model from url..
2020-06-06 02:21:56,480 Segmentron INFO:
Traceback (most recent call last):
File "./tools/train.py", line 223, in
trainer = Trainer(args)
File "./tools/train.py", line 112, in init
find_unused_parameters=True)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init
).format(device_ids, output_device, {p.device for p in module.parameters()})
AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [1], output_device 1, and module parameters {device(type='cuda', index=1), device(type='cpu')}.
2020-06-06 02:21:57,748 Segmentron INFO: DeepLabV3Plus flops: 413.257G input shape is [3, 1024, 2048], params: 41.055M
2020-06-06 02:21:57,776 Segmentron INFO: SyncBatchNorm is effective!
2020-06-06 02:21:57,776 Segmentron INFO: Set bn custom eps for bn in encoder: 0.001
Traceback (most recent call last):
File "./tools/train.py", line 223, in
trainer = Trainer(args)
File "./tools/train.py", line 112, in init
find_unused_parameters=True)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init
).format(device_ids, output_device, {p.device for p in module.parameters()})
AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cpu')}.
Traceback (most recent call last):
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/data1/mjq/anaconda3/envs/faceparsing/bin/python', '-u', './tools/train.py', '--local_rank=1', '--config-file', 'configs/pascal_voc_deeplabv3_plus.yaml']' returned non-zero exit status 1.

Thanks for your attention! @LikeLy-Journey

The text was updated successfully, but these errors were encountered:

jiawenhao2015 · 2020-09-29T09:28:02Z

@leonmakise hello,i met the same error,when i tried distributed training..
...i see your changes,but i can not understand what it means

It shall be self.model.cuda()
ori code:
self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)

your change is:
self.model.cuda() =nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
thank you~~~~looking for your reply~~

leonmakise changed the title ~~Problem when I tried distributed training~~ Bug in tools/train.py Jun 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in tools/train.py #59

Bug in tools/train.py #59

leonmakise commented Jun 5, 2020 •

edited

Loading

jiawenhao2015 commented Sep 29, 2020

Bug in tools/train.py #59

Bug in tools/train.py #59

Comments

leonmakise commented Jun 5, 2020 • edited Loading

jiawenhao2015 commented Sep 29, 2020

leonmakise commented Jun 5, 2020 •

edited

Loading