Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in tools/train.py #59

Open
leonmakise opened this issue Jun 5, 2020 · 1 comment
Open

Bug in tools/train.py #59

leonmakise opened this issue Jun 5, 2020 · 1 comment

Comments

@leonmakise
Copy link

leonmakise commented Jun 5, 2020

It worked by single GPU traing. But it failed no matter how many GPUs I appointed when I tried distributed training.

self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank],

It shall be self.model.cuda()

It works when I change this line.

The following part is the message of error I met with the former code:
(faceparsing) mjq@amax:~/SegmenTron$ CUDA_VISIBLE_DEVICES=0,7 ./tools/dist_train.sh ${CONFIG_FILE} configs/pascal_voc_deeplabv3_plus.yaml ${GPU_NUM} 2


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2020-06-06 02:21:55,815 Segmentron INFO: Using 2 GPUs
2020-06-06 02:21:55,816 Segmentron INFO: Namespace(config_file='configs/pascal_voc_deeplabv3_plus.yaml', device='cuda', distributed=True, input_img='tools/demo_vis.png', local_rank=0, log_iter=10, no_cuda=False, num_gpus=2, opts=[], resume=None, skip_val=False, val_epoch=1)
2020-06-06 02:21:55,816 Segmentron INFO: {
"SEED": 1024,
"TIME_STAMP": "2020-06-06-02-21",
"ROOT_PATH": "/data1/mjq/SegmenTron",
"PHASE": "train",
"DATASET": {
"NAME": "pascal_voc",
"MEAN": [
0.5,
0.5,
0.5
],
"STD": [
0.5,
0.5,
0.5
],
"IGNORE_INDEX": -1,
"WORKERS": 4,
"MODE": "val"
},
"AUG": {
"MIRROR": true,
"BLUR_PROB": 0.0,
"BLUR_RADIUS": 0.0,
"COLOR_JITTER": null
},
"TRAIN": {
"EPOCHS": 50,
"BATCH_SIZE": 4,
"CROP_SIZE": 480,
"BASE_SIZE": 520,
"MODEL_SAVE_DIR": "runs/checkpoints/",
"LOG_SAVE_DIR": "runs/logs/",
"PRETRAINED_MODEL_PATH": "",
"BACKBONE_PRETRAINED": true,
"BACKBONE_PRETRAINED_PATH": "",
"RESUME_MODEL_PATH": "",
"SYNC_BATCH_NORM": true,
"SNAPSHOT_EPOCH": 10
},
"SOLVER": {
"LR": 0.0001,
"OPTIMIZER": "sgd",
"EPSILON": 1e-08,
"MOMENTUM": 0.9,
"WEIGHT_DECAY": 0.0001,
"DECODER_LR_FACTOR": 10.0,
"LR_SCHEDULER": "poly",
"POLY": {
"POWER": 0.9
},
"STEP": {
"GAMMA": 0.1,
"DECAY_EPOCH": [
10,
20
]
},
"WARMUP": {
"EPOCHS": 0.0,
"FACTOR": 0.3333333333333333,
"METHOD": "linear"
},
"OHEM": false,
"AUX": false,
"AUX_WEIGHT": 0.4,
"LOSS_NAME": ""
},
"TEST": {
"TEST_MODEL_PATH": "",
"BATCH_SIZE": 8,
"CROP_SIZE": null,
"SCALES": [
1.0
],
"FLIP": false
},
"VISUAL": {
"OUTPUT_DIR": "../runs/visual/"
},
"MODEL": {
"MODEL_NAME": "DeepLabV3_Plus",
"BACKBONE": "xception65",
"BACKBONE_SCALE": 1.0,
"MULTI_LOSS_WEIGHT": [
1.0
],
"DEFAULT_GROUP_NUMBER": 32,
"DEFAULT_EPSILON": 1e-05,
"BN_TYPE": "BN",
"BN_EPS_FOR_ENCODER": 0.001,
"BN_EPS_FOR_DECODER": null,
"OUTPUT_STRIDE": 16,
"BN_MOMENTUM": null,
"DEEPLABV3_PLUS": {
"USE_ASPP": true,
"ENABLE_DECODER": true,
"ASPP_WITH_SEP_CONV": true,
"DECODER_USE_SEP_CONV": true
},
"CCNET": {
"RECURRENCE": 2
}
}
}
Found 1464 images in the folder datasets/voc/VOC2012
Found 1464 images in the folder datasets/voc/VOC2012
Found 1449 images in the folder datasets/voc/VOC2012
Found 1449 images in the folder datasets/voc/VOC2012
2020-06-06 02:21:56,181 Segmentron INFO: load backbone pretrained model from url..
2020-06-06 02:21:56,480 Segmentron INFO:
Traceback (most recent call last):
File "./tools/train.py", line 223, in
trainer = Trainer(args)
File "./tools/train.py", line 112, in init
find_unused_parameters=True)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init
).format(device_ids, output_device, {p.device for p in module.parameters()})
AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [1], output_device 1, and module parameters {device(type='cuda', index=1), device(type='cpu')}.
2020-06-06 02:21:57,748 Segmentron INFO: DeepLabV3Plus flops: 413.257G input shape is [3, 1024, 2048], params: 41.055M
2020-06-06 02:21:57,776 Segmentron INFO: SyncBatchNorm is effective!
2020-06-06 02:21:57,776 Segmentron INFO: Set bn custom eps for bn in encoder: 0.001
Traceback (most recent call last):
File "./tools/train.py", line 223, in
trainer = Trainer(args)
File "./tools/train.py", line 112, in init
find_unused_parameters=True)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in init
).format(device_ids, output_device, {p.device for p in module.parameters()})
AssertionError: DistributedDataParallel device_ids and output_device arguments only work with single-device CUDA modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cpu')}.
Traceback (most recent call last):
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/data1/mjq/anaconda3/envs/faceparsing/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/data1/mjq/anaconda3/envs/faceparsing/bin/python', '-u', './tools/train.py', '--local_rank=1', '--config-file', 'configs/pascal_voc_deeplabv3_plus.yaml']' returned non-zero exit status 1.

Thanks for your attention! @LikeLy-Journey

@leonmakise leonmakise changed the title Problem when I tried distributed training Bug in tools/train.py Jun 6, 2020
@jiawenhao2015
Copy link

@leonmakise hello,i met the same error,when i tried distributed training..
...i see your changes,but i can not understand what it means

It shall be self.model.cuda()
ori code:
self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)

your change is:
self.model.cuda() =nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
thank you~~~~looking for your reply~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants