Skip to content

Latest commit



206 lines (163 loc) · 10.2 KB

File metadata and controls

206 lines (163 loc) · 10.2 KB

Training recipes

We provide the specific commonds and hyper-parameters for ViTs, ResNets and ConvNexts in this recipe.

Training of ViT

1) Training with Setting I

This is a prevalent setting for training ResNets. To train ViT-small, you can use the following command.

python -m torch.distributed.launch --nproc_per_node=8 ./ 
    --data-dir ${IMAGENET_DIR}   \
    --model deit_small_patch16_224 \
    --sched cosine -j 10 \
    --epochs ${EPOCH} --weight-decay 0.02 \
    --opt Adan \ 
    --lr 1.5e-2  --opt-betas 0.98 0.92 0.99 \
    --opt-eps 1e-8 --max-grad-norm 0.0 \
    --warmup-lr 1e-8 --min-lr 1.0e-08 \
    -b 256 --amp \
    --aug-repeats 0 \
    --warmup-epochs 60 \
    --aa rand-m7-mstd0.5-inc1 \
    --smoothing 0.1 \
    --remode pixel \
    --reprob 0.0 \
    --bce \
    --drop 0.0 --drop-path 0.05 \
    --mixup 0.2 --cutmix 1.0 \
    --output ${OUT_DIR} \
    --experiment ${EXP_DIR}

After training, this command should give the following results. Note, it seems that this setting cannot improve the results of ViT-Base under training setting II (see below).

150 Epoch 300 Epoch
ViT small 80.1 81.1
download config/log/model config/log/model

2) Training with Setting II

This is the official setting used in Deit. Note, without distillation, DeiTs and ViTs are the same models. To train ViT-small, you can use the following command.

python -m torch.distributed.launch --nproc_per_node=8 ./ 
    --data-dir ${IMAGENET_DIR} \
    --model ${MODEL_NAME} \
    --sched cosine -j 10 \
    --epochs ${EPOCH} --weight-decay .02 \
    --opt Adan \ 
    --lr 1.5e-2  --opt-betas 0.98 0.92 0.99 \
    --opt-eps 1e-8 --max-grad-norm 5.0 \
    --warmup-lr 1e-8 --min-lr 1e-5 \
    -b 256 --amp \
    --aug-repeats ${REP} \
    --warmup-epochs 60 \
    --aa ${AUG}  \
    --smoothing 0.1 \
    --remode pixel \
    --reprob 0.25 \
    --drop 0.0 --drop-path ${Dp} \
    --mixup 0.8 --cutmix 1.0 \
    --output ${OUT_DIR} \
    --experiment ${EXP_DIR}

There is some differences between hyper-parameters for ViT-Base and ViT-Small. --bce means using the Binary Cross Entropy loss.

MODEL_NAME REP AUG BCE Bias-Decay Drop-path
ViT-Small deit_small_patch16_224 0 rand-m7-mstd0.5-inc1 True False 0.1
ViT-Base deit_base_patch16_224 3 rand-m9-mstd0.5-inc1 False True 0.2

After training, you should expect the following results. Note that ViT-Base (300 epoch) is trained by the faster version of Adan (foreach=True). For more details and settings, please refer to the corresponding configure files.

150 Epoch 300 Epoch
ViT-Small 79.6 80.9
download config/log/model config/log/model
ViT-Base 81.7 82.3/82.6
download config/log/model config/log/model


This is a default setting used to train ResNets. To train ResNet-50, you can use the following command.

python -m torch.distributed.launch --nproc_per_node=8 ./ 
    --data-dir ${IMAGENET_DIR} \
    --model resnet50 \
    --sched cosine -j 8 \
    --epochs ${EPOCH} --weight-decay .02 \
    --opt Adan \ 
    --lr ${LR}  --opt-betas 0.98 0.92 0.99 \
    --opt-eps 1e-8 --max-grad-norm 5.0 \
    --warmup-lr 1e-9 --min-lr 1.0e-05 --bias-decay \
    -b 256 --amp \
    --aug-repeats 0 \
    --warmup-epochs 60 \
    --aa rand-m7-mstd0.5-inc1 \
    --smoothing 0.0 \
    --remode pixel \
    --crop-pct 0.95 \
    --reprob 0.0 \
    --bce \
    --drop 0.0 --drop-path 0.05 \
    --mixup 0.1 --cutmix 1.0 \
    --output ${OUT_DIR} \
    --experiment ${EXP_DIR}

When training different epochs, we use slightly different learning rate, namely, LR = 3e-2 for EPOCH = 100 and LR = 1.5e-2 for EPOCH = 200 and 300. After training, you can get the following resutls:

100 Epoch 200 Epoch 300 Epoch
ResNet-50 78.1 79.7 80.2
download config/log/model config/log/model config/log/model


To train ResNet-101, you may use the following command.

python -m torch.distributed.launch --nproc_per_node=8 \ 
    --data-dir ${IMAGENET_DIR} \
    --model resnet101 \
    --sched cosine -j 8 \
    --epochs 300 --weight-decay .02 \
    --lr 1.5e-2  --warmup-lr 1e-9 --min-lr 1.0e-05 \
    -b 256 --amp --opt adan --opt-betas 0.98 0.92 0.99 --opt-eps 1e-8 \
    --max-grad-norm 5 \
    --bias-decay \
    --aug-repeats 0 \
    --warmup-epochs 90 \
    --aa rand-m7-mstd0.5-inc1 \
    --smoothing 0.0 \
    --remode pixel \ 
    --bce-loss \
    --crop-pct 0.95 \
    --reprob 0.0 \
    --drop 0.0 --drop-path 0.2 \
    --mixup 0.1 --cutmix 1.0 \
    --output ${OUT_DIR} \
    --experiment ${EXP_DIR}

We use slightly different learning rate, namely, LR = 1e-2 for EPOCH = 100 and LR = 1.5e-2 for EPOCH = 200 and 300. For more detailed training settings, please refer to the following configuration files. Note that the results for 100 and 300 epochs are obtained by the faster version Adan (foreach=True).

100 Epoch 200 Epoch 300 Epoch
ResNet-101 80.0 81.6 81.9
download config/log/model config/log/model config/log/model


This is a default setting to train ConvNext-tiny. To train ConvNext-tiny, you can use the following command.

python -m torch.distributed.launch --nproc_per_node=8 ./ 
    --data-dir ${IMAGENET_DIR} \
    --model convnext_tiny_hnf \
    --sched cosine -j 8 \
    --epochs ${EPOCH} --weight-decay .02 \
    --opt Adan \ 
    --lr 1.6e-2  --opt-betas 0.98 0.92 0.90 \
    --opt-eps 1e-8 --max-grad-norm 0.0 \
    --warmup-lr 1e-9 --min-lr 1.0e-05 --bias-decay \
    -b 256 --amp \
    --aug-repeats 0 \
    --warmup-epochs 150 \
    --aa rand-m7-mstd0.5-inc1 \
    --smoothing 0.1 \
    --remode pixel \
    --reprob 0.25 \
    --drop 0.0 --drop-path 0.1 \
    --mixup 0.8 --cutmix 1.0 \
    --model-ema \
    --train-interpolation random \
    --output ${OUT_DIR} \
    --experiment ${EXP_DIR}

For this training, the performance is NOT sensitive to some hyper-params, such as warmup-epochs and lr. But whether using model-ema plays a key role.

You can use the following config to train convnext tiny for 150 epoch, in which we do not utilize model-ema.

This results should be:

150 Epoch 300 Epoch
ConvNext-tiny 81.7 82.4
download config/log/model config/log/model