We provide the specific commonds and hyper-parameters for ViTs, ResNets and ConvNexts in this recipe.
This is a prevalent setting for training ResNets. To train ViT-small, you can use the following command.
python -m torch.distributed.launch --nproc_per_node=8 ./train.py
--data-dir ${IMAGENET_DIR} \
--model deit_small_patch16_224 \
--sched cosine -j 10 \
--epochs ${EPOCH} --weight-decay 0.02 \
--opt Adan \
--lr 1.5e-2 --opt-betas 0.98 0.92 0.99 \
--opt-eps 1e-8 --max-grad-norm 0.0 \
--warmup-lr 1e-8 --min-lr 1.0e-08 \
-b 256 --amp \
--aug-repeats 0 \
--warmup-epochs 60 \
--aa rand-m7-mstd0.5-inc1 \
--smoothing 0.1 \
--remode pixel \
--reprob 0.0 \
--bce \
--drop 0.0 --drop-path 0.05 \
--mixup 0.2 --cutmix 1.0 \
--output ${OUT_DIR} \
--experiment ${EXP_DIR}
After training, this command should give the following results. Note, it seems that this setting cannot improve the results of ViT-Base under training setting II (see below).
150 Epoch | 300 Epoch | |
---|---|---|
ViT small | 80.1 | 81.1 |
download | config/log/model | config/log/model |
This is the official setting used in Deit. Note, without distillation, DeiTs and ViTs are the same models. To train ViT-small, you can use the following command.
python -m torch.distributed.launch --nproc_per_node=8 ./train.py
--data-dir ${IMAGENET_DIR} \
--model ${MODEL_NAME} \
--sched cosine -j 10 \
--epochs ${EPOCH} --weight-decay .02 \
--opt Adan \
--lr 1.5e-2 --opt-betas 0.98 0.92 0.99 \
--opt-eps 1e-8 --max-grad-norm 5.0 \
--warmup-lr 1e-8 --min-lr 1e-5 \
-b 256 --amp \
--aug-repeats ${REP} \
--warmup-epochs 60 \
--aa ${AUG} \
--smoothing 0.1 \
--remode pixel \
--reprob 0.25 \
--drop 0.0 --drop-path ${Dp} \
--mixup 0.8 --cutmix 1.0 \
--output ${OUT_DIR} \
--experiment ${EXP_DIR}
There is some differences between hyper-parameters for ViT-Base and ViT-Small. --bce
means using the Binary Cross Entropy loss.
MODEL_NAME | REP | AUG | BCE | Bias-Decay | Drop-path | |
---|---|---|---|---|---|---|
ViT-Small | deit_small_patch16_224 | 0 | rand-m7-mstd0.5-inc1 | True | False | 0.1 |
ViT-Base | deit_base_patch16_224 | 3 | rand-m9-mstd0.5-inc1 | False | True | 0.2 |
After training, you should expect the following results. Note that ViT-Base (300 epoch) is trained by the faster version of Adan (foreach=True). For more details and settings, please refer to the corresponding configure files.
150 Epoch | 300 Epoch | |
---|---|---|
ViT-Small | 79.6 | 80.9 |
download | config/log/model | config/log/model |
ViT-Base | 81.7 | 82.3/82.6 |
download | config/log/model | config/log/model |
This is a default setting used to train ResNets. To train ResNet-50, you can use the following command.
python -m torch.distributed.launch --nproc_per_node=8 ./train.py
--data-dir ${IMAGENET_DIR} \
--model resnet50 \
--sched cosine -j 8 \
--epochs ${EPOCH} --weight-decay .02 \
--opt Adan \
--lr ${LR} --opt-betas 0.98 0.92 0.99 \
--opt-eps 1e-8 --max-grad-norm 5.0 \
--warmup-lr 1e-9 --min-lr 1.0e-05 --bias-decay \
-b 256 --amp \
--aug-repeats 0 \
--warmup-epochs 60 \
--aa rand-m7-mstd0.5-inc1 \
--smoothing 0.0 \
--remode pixel \
--crop-pct 0.95 \
--reprob 0.0 \
--bce \
--drop 0.0 --drop-path 0.05 \
--mixup 0.1 --cutmix 1.0 \
--output ${OUT_DIR} \
--experiment ${EXP_DIR}
When training different epochs, we use slightly different learning rate, namely, LR = 3e-2
for EPOCH = 100
and LR = 1.5e-2
for EPOCH = 200 and 300
. After training, you can get the following resutls:
100 Epoch | 200 Epoch | 300 Epoch | |
---|---|---|---|
ResNet-50 | 78.1 | 79.7 | 80.2 |
download | config/log/model | config/log/model | config/log/model |
To train ResNet-101, you may use the following command.
python -m torch.distributed.launch --nproc_per_node=8 train.py \
--data-dir ${IMAGENET_DIR} \
--model resnet101 \
--sched cosine -j 8 \
--epochs 300 --weight-decay .02 \
--lr 1.5e-2 --warmup-lr 1e-9 --min-lr 1.0e-05 \
-b 256 --amp --opt adan --opt-betas 0.98 0.92 0.99 --opt-eps 1e-8 \
--max-grad-norm 5 \
--bias-decay \
--aug-repeats 0 \
--warmup-epochs 90 \
--aa rand-m7-mstd0.5-inc1 \
--smoothing 0.0 \
--remode pixel \
--bce-loss \
--crop-pct 0.95 \
--reprob 0.0 \
--drop 0.0 --drop-path 0.2 \
--mixup 0.1 --cutmix 1.0 \
--output ${OUT_DIR} \
--experiment ${EXP_DIR}
We use slightly different learning rate, namely, LR = 1e-2
for EPOCH = 100
and LR = 1.5e-2
for EPOCH = 200
and 300
. For more detailed training settings, please refer to the following configuration files. Note that the results for 100 and 300 epochs are obtained by the faster version Adan (foreach=True).
100 Epoch | 200 Epoch | 300 Epoch | |
---|---|---|---|
ResNet-101 | 80.0 | 81.6 | 81.9 |
download | config/log/model | config/log/model | config/log/model |
This is a default setting to train ConvNext-tiny. To train ConvNext-tiny, you can use the following command.
python -m torch.distributed.launch --nproc_per_node=8 ./train.py
--data-dir ${IMAGENET_DIR} \
--model convnext_tiny_hnf \
--sched cosine -j 8 \
--epochs ${EPOCH} --weight-decay .02 \
--opt Adan \
--lr 1.6e-2 --opt-betas 0.98 0.92 0.90 \
--opt-eps 1e-8 --max-grad-norm 0.0 \
--warmup-lr 1e-9 --min-lr 1.0e-05 --bias-decay \
-b 256 --amp \
--aug-repeats 0 \
--warmup-epochs 150 \
--aa rand-m7-mstd0.5-inc1 \
--smoothing 0.1 \
--remode pixel \
--reprob 0.25 \
--drop 0.0 --drop-path 0.1 \
--mixup 0.8 --cutmix 1.0 \
--model-ema \
--train-interpolation random \
--output ${OUT_DIR} \
--experiment ${EXP_DIR}
For this training, the performance is NOT sensitive to some hyper-params, such as warmup-epochs
and lr
. But whether using model-ema
plays a key role.
You can use the following config to train convnext tiny for 150 epoch, in which we do not utilize model-ema
.
This results should be:
150 Epoch | 300 Epoch | |
---|---|---|
ConvNext-tiny | 81.7 | 82.4 |
download | config/log/model | config/log/model |