You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm recently working on model pruning in yolov5 framework, and I modified the train.py to use sparse training with torch-pruning.
When I tried to run the distributed training on 2 GPUs, some errors occur:
It seems that when I do dependency tracing in pruner initialization, tensors are not in the same device and that causes the problem. How can i avoid this problem when using distributed training?
Here shows my train.py after modification. The added code pieces are highlighted by "#".
# Ultralytics YOLOv3 🚀, AGPL-3.0 license"""Train a YOLOv3 model on a custom dataset. Models and datasets download automatically from the latest YOLOv3 release.Usage - Single-GPU training: $ python train.py --data coco128.yaml --weights yolov5s.pt --img 640 # from pretrained (recommended) $ python train.py --data coco128.yaml --weights '' --cfg yolov5s.yaml --img 640 # from scratchUsage - Multi-GPU DDP training: $ python -m torch.distributed.run --nproc_per_node 4 --master_port 1 train.py --data coco128.yaml --weights yolov5s.pt --img 640 --device 0,1,2,3Models: https://github.com/ultralytics/yolov5/tree/master/modelsDatasets: https://github.com/ultralytics/yolov5/tree/master/dataTutorial: https://docs.ultralytics.com/yolov5/tutorials/train_custom_data"""importargparseimportmathimportosimportrandomimportsubprocessimportsysimporttimefromcopyimportdeepcopyfromdatetimeimportdatetimefrompathlibimportPathimporttorch_pruningastpfrommodels.commonimportBottleneck3try:
importcomet_ml# must be imported before torch (if installed)exceptImportError:
comet_ml=Noneimportnumpyasnpimporttorchimporttorch.distributedasdistimporttorch.nnasnnimportyamlfromtorch.optimimportlr_schedulerfromtqdmimporttqdmFILE=Path(__file__).resolve()
ROOT=FILE.parents[0] # YOLOv3 root directoryifstr(ROOT) notinsys.path:
sys.path.append(str(ROOT)) # add ROOT to PATHROOT=Path(os.path.relpath(ROOT, Path.cwd())) # relativeimportvalasvalidate# for end-of-epoch mAPfrommodels.experimentalimportattempt_loadfrommodels.yoloimportModelfromutils.autoanchorimportcheck_anchorsfromutils.autobatchimportcheck_train_batch_sizefromutils.callbacksimportCallbacksfromutils.dataloadersimportcreate_dataloaderfromutils.downloadsimportattempt_download, is_urlfromutils.generalimport (
LOGGER,
TQDM_BAR_FORMAT,
check_amp,
check_dataset,
check_file,
check_git_info,
check_git_status,
check_img_size,
check_requirements,
check_suffix,
check_yaml,
colorstr,
get_latest_run,
increment_path,
init_seeds,
intersect_dicts,
labels_to_class_weights,
labels_to_image_weights,
methods,
one_cycle,
print_args,
print_mutation,
strip_optimizer,
yaml_save,
)
fromutils.loggersimportLoggersfromutils.loggers.comet.comet_utilsimportcheck_comet_resumefromutils.lossimportComputeLossfromutils.metricsimportfitnessfromutils.plotsimportplot_evolvefromutils.torch_utilsimport (
EarlyStopping,
ModelEMA,
de_parallel,
select_device,
smart_DDP,
smart_optimizer,
smart_resume,
torch_distributed_zero_first,
)
LOCAL_RANK=int(os.getenv("LOCAL_RANK", -1)) # https://pytorch.org/docs/stable/elastic/run.htmlRANK=int(os.getenv("RANK", -1))
WORLD_SIZE=int(os.getenv("WORLD_SIZE", 1))
GIT_INFO=check_git_info()
deftrain(hyp, opt, device, callbacks): # hyp is path/to/hyp.yaml or hyp dictionary""" Train a YOLOv3 model on a custom dataset and manage the training process. Args: hyp (str | dict): Path to hyperparameters yaml file or hyperparameters dictionary. opt (argparse.Namespace): Parsed command line arguments containing training options. device (torch.device): Device to load and train the model on. callbacks (Callbacks): Callbacks to handle various stages of the training lifecycle. Returns: None Usage - Single-GPU training: $ python train.py --data coco128.yaml --weights yolov5s.pt --img 640 # from pretrained (recommended) $ python train.py --data coco128.yaml --weights '' --cfg yolov5s.yaml --img 640 # from scratch Usage - Multi-GPU DDP training: $ python -m torch.distributed.run --nproc_per_node 4 --master_port 1 train.py --data coco128.yaml --weights yolov5s.pt --img 640 --device 0,1,2,3 Models: https://github.com/ultralytics/yolov5/tree/master/models Datasets: https://github.com/ultralytics/yolov5/tree/master/data Tutorial: https://docs.ultralytics.com/yolov5/tutorials/train_custom_data Examples: ```python from ultralytics import train import argparse import torch from utils.callbacks import Callbacks # Example usage args = argparse.Namespace( data='coco128.yaml', weights='yolov5s.pt', cfg='yolov5s.yaml', img_size=640, epochs=50, batch_size=16, device='0' ) device = torch.device(f'cuda:{args.device}' if torch.cuda.is_available() else 'cpu') callbacks = Callbacks() train(hyp='hyp.scratch.yaml', opt=args, device=device, callbacks=callbacks) ``` """save_dir, epochs, batch_size, weights, single_cls, evolve, data, cfg, resume, noval, nosave, workers, freeze= (
Path(opt.save_dir),
opt.epochs,
opt.batch_size,
opt.weights,
opt.single_cls,
opt.evolve,
opt.data,
opt.cfg,
opt.resume,
opt.noval,
opt.nosave,
opt.workers,
opt.freeze,
)
callbacks.run("on_pretrain_routine_start")
# Directoriesw=save_dir/"weights"# weights dir
(w.parentifevolveelsew).mkdir(parents=True, exist_ok=True) # make dirlast, best=w/"last.pt", w/"best.pt"# Hyperparametersifisinstance(hyp, str):
withopen(hyp, errors="ignore") asf:
hyp=yaml.safe_load(f) # load hyps dictLOGGER.info(colorstr("hyperparameters: ") +", ".join(f"{k}={v}"fork, vinhyp.items()))
opt.hyp=hyp.copy() # for saving hyps to checkpoints# Save run settingsifnotevolve:
yaml_save(save_dir/"hyp.yaml", hyp)
yaml_save(save_dir/"opt.yaml", vars(opt))
# Loggersdata_dict=NoneifRANKin {-1, 0}:
loggers=Loggers(save_dir, weights, opt, hyp, LOGGER) # loggers instance# Register actionsforkinmethods(loggers):
callbacks.register_action(k, callback=getattr(loggers, k))
# Process custom dataset artifact linkdata_dict=loggers.remote_datasetifresume: # If resuming runs from remote artifactweights, epochs, hyp, batch_size=opt.weights, opt.epochs, opt.hyp, opt.batch_size# Configplots=notevolveandnotopt.noplots# create plotscuda=device.type!="cpu"init_seeds(opt.seed+1+RANK, deterministic=True)
withtorch_distributed_zero_first(LOCAL_RANK):
data_dict=data_dictorcheck_dataset(data) # check if Nonetrain_path, val_path=data_dict["train"], data_dict["val"]
nc=1ifsingle_clselseint(data_dict["nc"]) # number of classesnames= {0: "item"} ifsingle_clsandlen(data_dict["names"]) !=1elsedata_dict["names"] # class namesis_coco=isinstance(val_path, str) andval_path.endswith("coco/val2017.txt") # COCO dataset# Modelcheck_suffix(weights, ".pt") # check weightspretrained=weights.endswith(".pt")
ifpretrained:
withtorch_distributed_zero_first(LOCAL_RANK):
weights=attempt_download(weights) # download if not found locallyckpt=torch.load(weights, map_location="cpu") # load checkpoint to CPU to avoid CUDA memory leakmodel=Model(cfgorckpt["model"].yaml, ch=3, nc=nc, anchors=hyp.get("anchors")).to(device) # createexclude= ["anchor"] if (cfgorhyp.get("anchors")) andnotresumeelse [] # exclude keyscsd=ckpt["model"].float().state_dict() # checkpoint state_dict as FP32csd=intersect_dicts(csd, model.state_dict(), exclude=exclude) # intersectmodel.load_state_dict(csd, strict=False) # loadLOGGER.info(f"Transferred {len(csd)}/{len(model.state_dict())} items from {weights}") # reportelse:
model=Model(cfg, ch=3, nc=nc, anchors=hyp.get("anchors")).to(device) # createamp=check_amp(model) # check AMP# Freezefreeze= [f"model.{x}."forxin (freezeiflen(freeze) >1elserange(freeze[0]))] # layers to freezefork, vinmodel.named_parameters():
v.requires_grad=True# train all layers# v.register_hook(lambda x: torch.nan_to_num(x)) # NaN to 0 (commented for erratic training results)ifany(xinkforxinfreeze):
LOGGER.info(f"freezing {k}")
v.requires_grad=False# Image sizegs=max(int(model.stride.max()), 32) # grid size (max stride)imgsz=check_img_size(opt.imgsz, gs, floor=gs*2) # verify imgsz is gs-multiple# Batch sizeifRANK==-1andbatch_size==-1: # single-GPU only, estimate best batch sizebatch_size=check_train_batch_size(model, imgsz, amp)
loggers.on_params_update({"batch_size": batch_size})
# Optimizernbs=64# nominal batch sizeaccumulate=max(round(nbs/batch_size), 1) # accumulate loss before optimizinghyp["weight_decay"] *=batch_size*accumulate/nbs# scale weight_decayoptimizer=smart_optimizer(model, opt.optimizer, hyp["lr0"], hyp["momentum"], hyp["weight_decay"])
# Schedulerifopt.cos_lr:
lf=one_cycle(1, hyp["lrf"], epochs) # cosine 1->hyp['lrf']else:
deflf(x):
"""Linear learning rate scheduler function with decay calculated by epoch proportion."""return (1-x/epochs) * (1.0-hyp["lrf"]) +hyp["lrf"] # linearscheduler=lr_scheduler.LambdaLR(optimizer, lr_lambda=lf) # plot_lr_scheduler(optimizer, scheduler, epochs)# EMAema=ModelEMA(model) ifRANKin {-1, 0} elseNone# Resumebest_fitness, start_epoch=0.0, 0ifpretrained:
ifresume:
best_fitness, start_epoch, epochs=smart_resume(ckpt, optimizer, ema, weights, epochs, resume)
delckpt, csd# DP modeifcudaandRANK==-1andtorch.cuda.device_count() >1:
LOGGER.warning(
"WARNING ⚠️ DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.\n""See Multi-GPU Tutorial at https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training to get started."
)
model=torch.nn.DataParallel(model)
# SyncBatchNormifopt.sync_bnandcudaandRANK!=-1:
model=torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
LOGGER.info("Using SyncBatchNorm()")
# Trainloadertrain_loader, dataset=create_dataloader(
train_path,
imgsz,
batch_size//WORLD_SIZE,
gs,
single_cls,
hyp=hyp,
augment=True,
cache=Noneifopt.cache=="val"elseopt.cache,
rect=opt.rect,
rank=LOCAL_RANK,
workers=workers,
image_weights=opt.image_weights,
quad=opt.quad,
prefix=colorstr("train: "),
shuffle=True,
seed=opt.seed,
)
labels=np.concatenate(dataset.labels, 0)
mlc=int(labels[:, 0].max()) # max label classassertmlc<nc, f"Label class {mlc} exceeds nc={nc} in {data}. Possible class labels are 0-{nc-1}"# Process 0ifRANKin {-1, 0}:
val_loader=create_dataloader(
val_path,
imgsz,
batch_size//WORLD_SIZE*2,
gs,
single_cls,
hyp=hyp,
cache=Noneifnovalelseopt.cache,
rect=True,
rank=-1,
workers=workers*2,
pad=0.5,
prefix=colorstr("val: "),
)[0]
ifnotresume:
ifnotopt.noautoanchor:
check_anchors(dataset, model=model, thr=hyp["anchor_t"], imgsz=imgsz) # run AutoAnchormodel.half().float() # pre-reduce anchor precisioncallbacks.run("on_pretrain_routine_end", labels, names)
# DDP modeifcudaandRANK!=-1:
model=smart_DDP(model)
# Model attributesnl=de_parallel(model).model[-1].nl# number of detection layers (to scale hyps)hyp["box"] *=3/nl# scale to layershyp["cls"] *=nc/80*3/nl# scale to classes and layershyp["obj"] *= (imgsz/640) **2*3/nl# scale to image size and layershyp["label_smoothing"] =opt.label_smoothingmodel.nc=nc# attach number of classes to modelmodel.hyp=hyp# attach hyperparameters to modelmodel.class_weights=labels_to_class_weights(dataset.labels, nc).to(device) *nc# attach class weightsmodel.names=names# Start trainingt0=time.time()
nb=len(train_loader) # number of batchesnw=max(round(hyp["warmup_epochs"] *nb), 100) # number of warmup iterations, max(3 epochs, 100 iterations)# nw = min(nw, (epochs - start_epoch) / 2 * nb) # limit warmup to < 1/2 of traininglast_opt_step=-1maps=np.zeros(nc) # mAP per classresults= (0, 0, 0, 0, 0, 0, 0) # P, R, [email protected], [email protected], val_loss(box, obj, cls)scheduler.last_epoch=start_epoch-1# do not movescaler=torch.cuda.amp.GradScaler(enabled=amp)
stopper, stop=EarlyStopping(patience=opt.patience), Falsecompute_loss=ComputeLoss(model) # init loss classcallbacks.run("on_train_start")
LOGGER.info(
f'Image sizes {imgsz} train, {imgsz} val\n'f'Using {train_loader.num_workers*WORLD_SIZE} dataloader workers\n'f"Logging results to {colorstr('bold', save_dir)}\n"f'Starting training for {epochs} epochs...'
)
##################################################################ifopt.sparse:
imp=tp.importance.GroupNormImportance(p=2)
example_inputs=torch.randn(opt.batch_size, 3, opt.imgsz, opt.imgsz).to(device) # dummy inputiftorch.cuda.device_count() >1:
ignored_layers= [model.module.model[27], model.module.model[33], model.module.model[39], model.module.model[40]]
forminmodel.module.modules():
ifisinstance(m, Bottleneck3):
ignored_layers.append(m)
else:
ignored_layers= [model.model[27], model.model[33], model.model[39], model.model[40]]
forminmodel.modules():
ifisinstance(m, Bottleneck3):
ignored_layers.append(m)
iterative_steps=1pruner=tp.pruner.GroupNormPruner(
model,
example_inputs,
imp,
ignored_layers=ignored_layers,
iterative_steps=iterative_steps,
ch_sparsity=0.2,
isomorphic=True,
global_pruning=True
)
###############################################################forepochinrange(start_epoch, epochs): # epoch ------------------------------------------------------------------callbacks.run("on_train_epoch_start")
model.train()
# update regularizer############ifopt.sparse:
pruner.update_regularizer()
############# Update image weights (optional, single-GPU only)ifopt.image_weights:
cw=model.class_weights.cpu().numpy() * (1-maps) **2/nc# class weightsiw=labels_to_image_weights(dataset.labels, nc=nc, class_weights=cw) # image weightsdataset.indices=random.choices(range(dataset.n), weights=iw, k=dataset.n) # rand weighted idx# Update mosaic border (optional)# b = int(random.uniform(0.25 * imgsz, 0.75 * imgsz + gs) // gs * gs)# dataset.mosaic_border = [b - imgsz, -b] # height, width bordersmloss=torch.zeros(3, device=device) # mean lossesifRANK!=-1:
train_loader.sampler.set_epoch(epoch)
pbar=enumerate(train_loader)
LOGGER.info(("\n"+"%11s"*7) % ("Epoch", "GPU_mem", "box_loss", "obj_loss", "cls_loss", "Instances", "Size"))
ifRANKin {-1, 0}:
pbar=tqdm(pbar, total=nb, bar_format=TQDM_BAR_FORMAT) # progress baroptimizer.zero_grad()
fori, (imgs, targets, paths, _) inpbar: # batch -------------------------------------------------------------callbacks.run("on_train_batch_start")
ni=i+nb*epoch# number integrated batches (since train start)imgs=imgs.to(device, non_blocking=True).float() /255# uint8 to float32, 0-255 to 0.0-1.0# Warmupifni<=nw:
xi= [0, nw] # x interp# compute_loss.gr = np.interp(ni, xi, [0.0, 1.0]) # iou loss ratio (obj_loss = 1.0 or iou)accumulate=max(1, np.interp(ni, xi, [1, nbs/batch_size]).round())
forj, xinenumerate(optimizer.param_groups):
# bias lr falls from 0.1 to lr0, all other lrs rise from 0.0 to lr0x["lr"] =np.interp(ni, xi, [hyp["warmup_bias_lr"] ifj==0else0.0, x["initial_lr"] *lf(epoch)])
if"momentum"inx:
x["momentum"] =np.interp(ni, xi, [hyp["warmup_momentum"], hyp["momentum"]])
# Multi-scaleifopt.multi_scale:
sz=random.randrange(int(imgsz*0.5), int(imgsz*1.5) +gs) //gs*gs# sizesf=sz/max(imgs.shape[2:]) # scale factorifsf!=1:
ns= [math.ceil(x*sf/gs) *gsforxinimgs.shape[2:]] # new shape (stretched to gs-multiple)imgs=nn.functional.interpolate(imgs, size=ns, mode="bilinear", align_corners=False)
# Forwardwithtorch.cuda.amp.autocast(amp):
pred=model(imgs) # forwardloss, loss_items=compute_loss(pred, targets.to(device)) # loss scaled by batch_sizeifRANK!=-1:
loss*=WORLD_SIZE# gradient averaged between devices in DDP modeifopt.quad:
loss*=4.0# Backwardscaler.scale(loss).backward()
# Optimize - https://pytorch.org/docs/master/notes/amp_examples.htmlifni-last_opt_step>=accumulate:
scaler.unscale_(optimizer) # unscale gradientstorch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10.0) # clip gradients###########ifopt.sparse:
pruner.regularize(model)
###########scaler.step(optimizer) # optimizer.stepscaler.update()
optimizer.zero_grad()
ifema:
ema.update(model)
last_opt_step=ni# LogifRANKin {-1, 0}:
mloss= (mloss*i+loss_items) / (i+1) # update mean lossesmem=f"{torch.cuda.memory_reserved() /1E9iftorch.cuda.is_available() else0:.3g}G"# (GB)pbar.set_description(
("%11s"*2+"%11.4g"*5)
% (f"{epoch}/{epochs-1}", mem, *mloss, targets.shape[0], imgs.shape[-1])
)
callbacks.run("on_train_batch_end", model, ni, imgs, targets, paths, list(mloss))
ifcallbacks.stop_training:
return# end batch ------------------------------------------------------------------------------------------------# Schedulerlr= [x["lr"] forxinoptimizer.param_groups] # for loggersscheduler.step()
ifRANKin {-1, 0}:
# mAPcallbacks.run("on_train_epoch_end", epoch=epoch)
ema.update_attr(model, include=["yaml", "nc", "hyp", "names", "stride", "class_weights"])
final_epoch= (epoch+1==epochs) orstopper.possible_stopifnotnovalorfinal_epoch: # Calculate mAPresults, maps, _=validate.run(
data_dict,
batch_size=batch_size//WORLD_SIZE*2,
imgsz=imgsz,
half=amp,
model=ema.ema,
single_cls=single_cls,
dataloader=val_loader,
save_dir=save_dir,
plots=False,
callbacks=callbacks,
compute_loss=compute_loss,
)
# Update best mAPfi=fitness(np.array(results).reshape(1, -1)) # weighted combination of [P, R, [email protected], [email protected]]stop=stopper(epoch=epoch, fitness=fi) # early stop checkiffi>best_fitness:
best_fitness=filog_vals=list(mloss) +list(results) +lrcallbacks.run("on_fit_epoch_end", log_vals, epoch, best_fitness, fi)
# Save modelif (notnosave) or (final_epochandnotevolve): # if saveckpt= {
"epoch": epoch,
"best_fitness": best_fitness,
"model": deepcopy(de_parallel(model)).half(),
"ema": deepcopy(ema.ema).half(),
"updates": ema.updates,
"optimizer": optimizer.state_dict(),
"opt": vars(opt),
"git": GIT_INFO, # {remote, branch, commit} if a git repo"date": datetime.now().isoformat(),
}
# Save last, best and deletetorch.save(ckpt, last)
ifbest_fitness==fi:
torch.save(ckpt, best)
ifopt.save_period>0andepoch%opt.save_period==0:
torch.save(ckpt, w/f"epoch{epoch}.pt")
delckptcallbacks.run("on_model_save", last, epoch, final_epoch, best_fitness, fi)
# EarlyStoppingifRANK!=-1: # if DDP trainingbroadcast_list= [stopifRANK==0elseNone]
dist.broadcast_object_list(broadcast_list, 0) # broadcast 'stop' to all ranksifRANK!=0:
stop=broadcast_list[0]
ifstop:
break# must break all DDP ranks# end epoch ----------------------------------------------------------------------------------------------------# end training -----------------------------------------------------------------------------------------------------ifRANKin {-1, 0}:
LOGGER.info(f"\n{epoch-start_epoch+1} epochs completed in {(time.time() -t0) /3600:.3f} hours.")
forfinlast, best:
iff.exists():
strip_optimizer(f) # strip optimizersiffisbest:
LOGGER.info(f"\nValidating {f}...")
results, _, _=validate.run(
data_dict,
batch_size=batch_size//WORLD_SIZE*2,
imgsz=imgsz,
model=attempt_load(f, device).half(),
iou_thres=0.65ifis_cocoelse0.60, # best pycocotools at iou 0.65single_cls=single_cls,
dataloader=val_loader,
save_dir=save_dir,
save_json=is_coco,
verbose=True,
plots=plots,
callbacks=callbacks,
compute_loss=compute_loss,
) # val best model with plotsifis_coco:
callbacks.run("on_fit_epoch_end", list(mloss) +list(results) +lr, epoch, best_fitness, fi)
callbacks.run("on_train_end", last, best, epoch, results)
torch.cuda.empty_cache()
returnresultsdefparse_opt(known=False):
""" Parse command line arguments for configuring the training of a YOLO model. Args: known (bool): Flag to parse known arguments only, defaults to False. Returns: (argparse.Namespace): Parsed command line arguments. Examples: ```python options = parse_opt() print(options.weights) ``` Notes: * The default weights path is 'yolov3-tiny.pt'. * Set `known` to True for parsing only the known arguments, useful for partial arguments. References: * Models: https://github.com/ultralytics/yolov5/tree/master/models * Datasets: https://github.com/ultralytics/yolov5/tree/master/data * Training Tutorial: https://docs.ultralytics.com/yolov5/tutorials/train_custom_data """parser=argparse.ArgumentParser()
parser.add_argument("--weights", type=str, default=ROOT/"yolov3-tiny.pt", help="initial weights path")
parser.add_argument("--cfg", type=str, default="", help="model.yaml path")
parser.add_argument("--data", type=str, default=ROOT/"data/coco128.yaml", help="dataset.yaml path")
parser.add_argument("--hyp", type=str, default=ROOT/"data/hyps/hyp.scratch-low.yaml", help="hyperparameters path")
parser.add_argument("--epochs", type=int, default=100, help="total training epochs")
parser.add_argument("--batch-size", type=int, default=16, help="total batch size for all GPUs, -1 for autobatch")
parser.add_argument("--imgsz", "--img", "--img-size", type=int, default=640, help="train, val image size (pixels)")
parser.add_argument("--rect", action="store_true", help="rectangular training")
parser.add_argument("--resume", nargs="?", const=True, default=False, help="resume most recent training")
parser.add_argument("--nosave", action="store_true", help="only save final checkpoint")
parser.add_argument("--noval", action="store_true", help="only validate final epoch")
parser.add_argument("--noautoanchor", action="store_true", help="disable AutoAnchor")
parser.add_argument("--noplots", action="store_true", help="save no plot files")
parser.add_argument("--evolve", type=int, nargs="?", const=300, help="evolve hyperparameters for x generations")
parser.add_argument("--bucket", type=str, default="", help="gsutil bucket")
parser.add_argument("--cache", type=str, nargs="?", const="ram", help="image --cache ram/disk")
parser.add_argument("--image-weights", action="store_true", help="use weighted image selection for training")
parser.add_argument("--device", default="", help="cuda device, i.e. 0 or 0,1,2,3 or cpu")
parser.add_argument("--multi-scale", action="store_true", help="vary img-size +/- 50%%")
parser.add_argument("--single-cls", action="store_true", help="train multi-class data as single-class")
parser.add_argument("--optimizer", type=str, choices=["SGD", "Adam", "AdamW"], default="SGD", help="optimizer")
parser.add_argument("--sync-bn", action="store_true", help="use SyncBatchNorm, only available in DDP mode")
parser.add_argument("--workers", type=int, default=8, help="max dataloader workers (per RANK in DDP mode)")
parser.add_argument("--project", default=ROOT/"runs/train", help="save to project/name")
parser.add_argument("--name", default="exp", help="save to project/name")
parser.add_argument("--exist-ok", action="store_true", help="existing project/name ok, do not increment")
parser.add_argument("--quad", action="store_true", help="quad dataloader")
parser.add_argument("--cos-lr", action="store_true", help="cosine LR scheduler")
parser.add_argument("--label-smoothing", type=float, default=0.0, help="Label smoothing epsilon")
parser.add_argument("--patience", type=int, default=100, help="EarlyStopping patience (epochs without improvement)")
parser.add_argument("--freeze", nargs="+", type=int, default=[0], help="Freeze layers: backbone=10, first3=0 1 2")
parser.add_argument("--save-period", type=int, default=-1, help="Save checkpoint every x epochs (disabled if < 1)")
parser.add_argument("--seed", type=int, default=0, help="Global training seed")
parser.add_argument("--local_rank", type=int, default=-1, help="Automatic DDP Multi-GPU argument, do not modify")
parser.add_argument("--sparse", action="store_true", help="Sparse weights for training")
# Logger argumentsparser.add_argument("--entity", default=None, help="Entity")
parser.add_argument("--upload_dataset", nargs="?", const=True, default=False, help='Upload data, "val" option')
parser.add_argument("--bbox_interval", type=int, default=-1, help="Set bounding-box image logging interval")
parser.add_argument("--artifact_alias", type=str, default="latest", help="Version of dataset artifact to use")
returnparser.parse_known_args()[0] ifknownelseparser.parse_args()
defmain(opt, callbacks=Callbacks()):
""" Main training/evolution script handling model checks, DDP setup, training, and hyperparameter evolution. Args: opt (argparse.Namespace): Parsed command-line options. callbacks (Callbacks, optional): Callback object for handling training events. Defaults to Callbacks(). Returns: None Raises: AssertionError: If certain constraints are violated (e.g., when specific options are incompatible with DDP training). Notes: - For a tutorial on using Multi-GPU with DDP: https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training Example: Single-GPU training: ```python $ python train.py --data coco128.yaml --weights yolov5s.pt --img 640 # from pretrained (recommended) $ python train.py --data coco128.yaml --weights '' --cfg yolov5s.yaml --img 640 # from scratch ``` Multi-GPU DDP training: ```python $ python -m torch.distributed.run --nproc_per_node 4 --master_port 1 train.py --data coco128.yaml \ --weights yolov5s.pt --img 640 --device 0,1,2,3 ``` Models: https://github.com/ultralytics/yolov5/tree/master/models Datasets: https://github.com/ultralytics/yolov5/tree/master/data Tutorial: https://docs.ultralytics.com/yolov5/tutorials/train_custom_data """ifRANKin {-1, 0}:
print_args(vars(opt))
check_git_status()
check_requirements(ROOT/"requirements.txt")
# Resume (from specified or most recent last.pt)ifopt.resumeandnotcheck_comet_resume(opt) andnotopt.evolve:
last=Path(check_file(opt.resume) ifisinstance(opt.resume, str) elseget_latest_run())
opt_yaml=last.parent.parent/"opt.yaml"# train options yamlopt_data=opt.data# original datasetifopt_yaml.is_file():
withopen(opt_yaml, errors="ignore") asf:
d=yaml.safe_load(f)
else:
d=torch.load(last, map_location="cpu")["opt"]
opt=argparse.Namespace(**d) # replaceopt.cfg, opt.weights, opt.resume="", str(last), True# reinstateifis_url(opt_data):
opt.data=check_file(opt_data) # avoid HUB resume auth timeoutelse:
opt.data, opt.cfg, opt.hyp, opt.weights, opt.project= (
check_file(opt.data),
check_yaml(opt.cfg),
check_yaml(opt.hyp),
str(opt.weights),
str(opt.project),
) # checksassertlen(opt.cfg) orlen(opt.weights), "either --cfg or --weights must be specified"ifopt.evolve:
ifopt.project==str(ROOT/"runs/train"): # if default project name, rename to runs/evolveopt.project=str(ROOT/"runs/evolve")
opt.exist_ok, opt.resume=opt.resume, False# pass resume to exist_ok and disable resumeifopt.name=="cfg":
opt.name=Path(opt.cfg).stem# use model.yaml as nameopt.save_dir=str(increment_path(Path(opt.project) /opt.name, exist_ok=opt.exist_ok))
# DDP modedevice=select_device(opt.device, batch_size=opt.batch_size)
ifLOCAL_RANK!=-1:
msg="is not compatible with YOLOv3 Multi-GPU DDP training"assertnotopt.image_weights, f"--image-weights {msg}"assertnotopt.evolve, f"--evolve {msg}"assertopt.batch_size!=-1, f"AutoBatch with --batch-size -1 {msg}, please pass a valid --batch-size"assertopt.batch_size%WORLD_SIZE==0, f"--batch-size {opt.batch_size} must be multiple of WORLD_SIZE"asserttorch.cuda.device_count() >LOCAL_RANK, "insufficient CUDA devices for DDP command"torch.cuda.set_device(LOCAL_RANK)
device=torch.device("cuda", LOCAL_RANK)
dist.init_process_group(backend="nccl"ifdist.is_nccl_available() else"gloo")
# Trainifnotopt.evolve:
train(opt.hyp, opt, device, callbacks)
# Evolve hyperparameters (optional)else:
# Hyperparameter evolution metadata (mutation scale 0-1, lower_limit, upper_limit)meta= {
"lr0": (1, 1e-5, 1e-1), # initial learning rate (SGD=1E-2, Adam=1E-3)"lrf": (1, 0.01, 1.0), # final OneCycleLR learning rate (lr0 * lrf)"momentum": (0.3, 0.6, 0.98), # SGD momentum/Adam beta1"weight_decay": (1, 0.0, 0.001), # optimizer weight decay"warmup_epochs": (1, 0.0, 5.0), # warmup epochs (fractions ok)"warmup_momentum": (1, 0.0, 0.95), # warmup initial momentum"warmup_bias_lr": (1, 0.0, 0.2), # warmup initial bias lr"box": (1, 0.02, 0.2), # box loss gain"cls": (1, 0.2, 4.0), # cls loss gain"cls_pw": (1, 0.5, 2.0), # cls BCELoss positive_weight"obj": (1, 0.2, 4.0), # obj loss gain (scale with pixels)"obj_pw": (1, 0.5, 2.0), # obj BCELoss positive_weight"iou_t": (0, 0.1, 0.7), # IoU training threshold"anchor_t": (1, 2.0, 8.0), # anchor-multiple threshold"anchors": (2, 2.0, 10.0), # anchors per output grid (0 to ignore)"fl_gamma": (0, 0.0, 2.0), # focal loss gamma (efficientDet default gamma=1.5)"hsv_h": (1, 0.0, 0.1), # image HSV-Hue augmentation (fraction)"hsv_s": (1, 0.0, 0.9), # image HSV-Saturation augmentation (fraction)"hsv_v": (1, 0.0, 0.9), # image HSV-Value augmentation (fraction)"degrees": (1, 0.0, 45.0), # image rotation (+/- deg)"translate": (1, 0.0, 0.9), # image translation (+/- fraction)"scale": (1, 0.0, 0.9), # image scale (+/- gain)"shear": (1, 0.0, 10.0), # image shear (+/- deg)"perspective": (0, 0.0, 0.001), # image perspective (+/- fraction), range 0-0.001"flipud": (1, 0.0, 1.0), # image flip up-down (probability)"fliplr": (0, 0.0, 1.0), # image flip left-right (probability)"mosaic": (1, 0.0, 1.0), # image mixup (probability)"mixup": (1, 0.0, 1.0), # image mixup (probability)"copy_paste": (1, 0.0, 1.0),
} # segment copy-paste (probability)withopen(opt.hyp, errors="ignore") asf:
hyp=yaml.safe_load(f) # load hyps dictif"anchors"notinhyp: # anchors commented in hyp.yamlhyp["anchors"] =3ifopt.noautoanchor:
delhyp["anchors"], meta["anchors"]
opt.noval, opt.nosave, save_dir=True, True, Path(opt.save_dir) # only val/save final epoch# ei = [isinstance(x, (int, float)) for x in hyp.values()] # evolvable indicesevolve_yaml, evolve_csv=save_dir/"hyp_evolve.yaml", save_dir/"evolve.csv"ifopt.bucket:
# download evolve.csv if existssubprocess.run(
[
"gsutil",
"cp",
f"gs://{opt.bucket}/evolve.csv",
str(evolve_csv),
]
)
for_inrange(opt.evolve): # generations to evolveifevolve_csv.exists(): # if evolve.csv exists: select best hyps and mutate# Select parent(s)parent="single"# parent selection method: 'single' or 'weighted'x=np.loadtxt(evolve_csv, ndmin=2, delimiter=",", skiprows=1)
n=min(5, len(x)) # number of previous results to considerx=x[np.argsort(-fitness(x))][:n] # top n mutationsw=fitness(x) -fitness(x).min() +1e-6# weights (sum > 0)ifparent=="single"orlen(x) ==1:
# x = x[random.randint(0, n - 1)] # random selectionx=x[random.choices(range(n), weights=w)[0]] # weighted selectionelifparent=="weighted":
x= (x*w.reshape(n, 1)).sum(0) /w.sum() # weighted combination# Mutatemp, s=0.8, 0.2# mutation probability, sigmanpr=np.randomnpr.seed(int(time.time()))
g=np.array([meta[k][0] forkinhyp.keys()]) # gains 0-1ng=len(meta)
v=np.ones(ng)
whileall(v==1): # mutate until a change occurs (prevent duplicates)v= (g* (npr.random(ng) <mp) *npr.randn(ng) *npr.random() *s+1).clip(0.3, 3.0)
fori, kinenumerate(hyp.keys()): # plt.hist(v.ravel(), 300)hyp[k] =float(x[i+7] *v[i]) # mutate# Constrain to limitsfork, vinmeta.items():
hyp[k] =max(hyp[k], v[1]) # lower limithyp[k] =min(hyp[k], v[2]) # upper limithyp[k] =round(hyp[k], 5) # significant digits# Train mutationresults=train(hyp.copy(), opt, device, callbacks)
callbacks=Callbacks()
# Write mutation resultskeys= (
"metrics/precision",
"metrics/recall",
"metrics/mAP_0.5",
"metrics/mAP_0.5:0.95",
"val/box_loss",
"val/obj_loss",
"val/cls_loss",
)
print_mutation(keys, results, hyp.copy(), save_dir, opt.bucket)
# Plot resultsplot_evolve(evolve_csv)
LOGGER.info(
f'Hyperparameter evolution finished {opt.evolve} generations\n'f"Results saved to {colorstr('bold', save_dir)}\n"f'Usage example: $ python train.py --hyp {evolve_yaml}'
)
defrun(**kwargs):
""" Run the training process for a YOLOv3 model with the specified configurations. Args: data (str): Path to the dataset YAML file. weights (str): Path to the pre-trained weights file or '' to train from scratch. cfg (str): Path to the model configuration file. hyp (str): Path to the hyperparameters YAML file. epochs (int): Total number of training epochs. batch_size (int): Total batch size across all GPUs. imgsz (int): Image size for training and validation (in pixels). rect (bool): Use rectangular training for better aspect ratio preservation. resume (bool | str): Resume most recent training if True, or resume training from a specific checkpoint if a string. nosave (bool): Only save the final checkpoint and not the intermediate ones. noval (bool): Only validate model performance in the final epoch. noautoanchor (bool): Disable automatic anchor generation. noplots (bool): Do not save any plots. evolve (int): Number of generations for hyperparameters evolution. bucket (str): Google Cloud Storage bucket name for saving run artifacts. cache (str | None): Cache images for faster training ('ram' or 'disk'). image_weights (bool): Use weighted image selection for training. device (str): Device to use for training, e.g., '0' for first GPU or 'cpu' for CPU. multi_scale (bool): Use multi-scale training. single_cls (bool): Train a multi-class dataset as a single-class. optimizer (str): Optimizer to use ('SGD', 'Adam', or 'AdamW'). sync_bn (bool): Use synchronized batch normalization (only in DDP mode). workers (int): Maximum number of dataloader workers (per rank in DDP mode). project (str): Location of the output directory. name (str): Unique name for the run. exist_ok (bool): Allow existing output directory. quad (bool): Use quad dataloader. cos_lr (bool): Use cosine learning rate scheduler. label_smoothing (float): Label smoothing epsilon. patience (int): EarlyStopping patience (epochs without improvement). freeze (list[int]): List of layers to freeze, e.g., [0] to freeze only the first layer. save_period (int): Save checkpoint every 'save_period' epochs (disabled if less than 1). seed (int): Global training seed for reproducibility. local_rank (int): For automatic DDP Multi-GPU argument parsing, do not modify. Returns: None Example: ```python from ultralytics import run run(data='coco128.yaml', weights='yolov5m.pt', imgsz=320, epochs=100, batch_size=16) ``` Notes: - Ensure the dataset YAML file and initial weights are accessible. - Refer to the [Ultralytics YOLOv5 repository](https://github.com/ultralytics/yolov5) for model and data configurations. - Use the [Training Tutorial](https://docs.ultralytics.com/yolov5/tutorials/train_custom_data) for custom dataset training. """opt=parse_opt(True)
fork, vinkwargs.items():
setattr(opt, k, v)
main(opt)
returnoptif__name__=="__main__":
opt=parse_opt()
main(opt)
The text was updated successfully, but these errors were encountered:
tobymuller233
changed the title
Sparse training device problem
Distributed sparse training device problem
Nov 21, 2024
I'm recently working on model pruning in yolov5 framework, and I modified the
train.py
to use sparse training with torch-pruning.When I tried to run the distributed training on 2 GPUs, some errors occur:
It seems that when I do dependency tracing in pruner initialization, tensors are not in the same device and that causes the problem. How can i avoid this problem when using distributed training?
Here shows my
train.py
after modification. The added code pieces are highlighted by "#".The text was updated successfully, but these errors were encountered: