Training

We provide ImageNet-1K training commands here. Please check INSTALL.md for installation instructions first. Please refer to OpenMixup implementation for RSB A3 and ImageNet-21K training.

ImageNet-1K Training

Taking MogaNet-T as an example, you can use the following command to run this experiment on a single machine (8GPUs):

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_tiny --img_size 224 --drop_path 0.1 \
--epochs 300 --batch_size 128 --lr 1e-3 --weight_decay 0.04 \
--aa rand-m7-mstd0.5-inc1 --crop_pct 0.9 --mixup 0.1 \
--amp --native_amp \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

Batch size scaling. The effective batch size is equal to --nproc_per_node * --batch_size. In the example above, the effective batch size is 8*128 = 1024. Running on one machine, we can reduce --batch_size and use --amp to avoid OOM issues while keeping the total batch size unchanged. As for fp16 training with Pytorch>=1.6.0, we recommend using --amp --native_amp instead of apex-amp.
Learning rate scaling. The default learning rate setting is lr=1e-3 / bs1024. We find that lr=2e-3 / bs1024 and lr=1e-3 / bs512 produce better performances and more stable training for MogaNet-XT/T and MogaNet-S/B/L/XL.
EMA evaluation. We adopt the EMA trick for MogaNet-S/B/L using --model_ema and --model_ema_decay 0.9999 for better performances.
The difference between this repo and OpenMixup's implementation. In OpenMixup, we adopt attn_force_fp32 to run the gating functions with fp32 to avoid inf or nan during fp16 training. We found that if we use attn_force_fp32=True during training, it should also keep attn_force_fp32=True during evaluation because the difference between the output results of using attn_force_fp32 or not. It will not affect performances of fully fine-tuning but the results of transfer learning (e.g., COCO Mask-RCNN freezes the parameters of the first stage). We set attn_force_fp32 to true in OpenMixup while turning it off in this repo (to facilitate code migration).

To train other MogaNet variants, --model and --drop_path need to be changed. Examples with single-machine commands are given below:

MogaNet-XT

Single-machine (8GPUs) with the input size of 224:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_xtiny --img_size 224 --drop_path 0.05 \
--epochs 300 --batch_size 128 --lr 1e-3 --weight_decay 0.03 \
--aa rand-m7-mstd0.5-inc1 --crop_pct 0.9 --mixup 0.1 \
--amp --native_amp \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

MogaNet-Tiny

Single-machine (8GPUs) with the input size of 224:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_tiny --img_size 224 --drop_path 0.1 \
--epochs 300 --batch_size 128 --lr 1e-3 --weight_decay 0.04 \
--aa rand-m7-mstd0.5-inc1 --crop_pct 0.9 --mixup 0.1 \
--amp --native_amp \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

Single-machine (8GPUs) with the input size of 256:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_tiny --img_size 256 --drop_path 0.1 \
--epochs 300 --batch_size 128 --lr 1e-3 --weight_decay 0.04 \
--aa rand-m7-mstd0.5-inc1 --crop_pct 0.9 --mixup 0.1 \
--amp --native_amp \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

MogaNet-Small

Single-machine (8GPUs) with the input size of 224 with EMA (you can evaluate it without EMA):

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_small --img_size 224 --drop_path 0.1 \
--epochs 300 --batch_size 128 --lr 1e-3 --weight_decay 0.05 \
--crop_pct 0.9 --min_lr 1e-5 \
--model_ema --model_ema_decay 0.9999 \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

MogaNet-Base

Single-machine (8GPUs) with the input size of 224 with EMA:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_base --img_size 224 --drop_path 0.2 \
--epochs 300 --batch_size 128 --lr 1e-3 --weight_decay 0.05 \
--crop_pct 0.9 --min_lr 1e-5 \
--model_ema --model_ema_decay 0.9999 \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

MogaNet-Large

Single-machine (8GPUs) with the input size of 224 with EMA:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_large --img_size 224 --drop_path 0.3 \
--epochs 300 --batch_size 128 --lr 1e-3 --weight_decay 0.05 \
--crop_pct 0.9 --min_lr 1e-5 \
--model_ema --model_ema_decay 0.9999 \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

MogaNet-XLarge

Single-machine (8GPUs) with the input size of 224 and the batch size of 512 with EMA:

python -m torch.distributed.launch --nproc_per_node=8 train.py \
--model moganet_xlarge --img_size 224 --drop_path 0.4 \
--epochs 300 --batch_size 64 --lr 1e-3 --weight_decay 0.05 \
--crop_pct 0.9 --min_lr 1e-5 \
--model_ema --model_ema_decay 0.9999 \
--data_dir /path/to/imagenet-1k \
--experiment /path/to/save_results

ImageNet-1K Validation

Taking MogaNet-T as an example, you can use the following command to run the validation on ImageNet val set:

python validate.py \
--model moganet_tiny --img_size 224 --crop_pct 0.9 \
--data_dir /path/to/imagenet-1k \
--checkpoint /path/to/checkpoint.tar.gz

In the example above, we test the model in 224x224x3 resolutions (modified by --img_size 224 or --img_size 224 224 3) without using the EMA model. Please add --use_ema to enable EMA evaluation for MogaNet-Small, MogaNet-Base, and MogaNet-Large. Running on one machine, we can use --num_gpu and use --amp to avoid OOM issues.

To evaluate other MogaNet variants, --model and --use_ema need to be changed. Examples with single-machine commands are given below:

MogaNet-XT