⚠️ Disclaimer: All datasets hyperlinked from this page are not owned or
distributed by Google. The dataset is made available by third parties.
Please review the terms and conditions made available by the third parties
before using the data.
TF-Vision modeling library for computer vision provides a collection of
baselines and checkpoints for image classification, object detection, and
segmentation.
ResNet models trained with vanilla settings
Models are trained from scratch with batch size 4096 and 1.6 initial learning
rate.
Linear warmup is applied for the first 5 epochs.
Models trained with l2 weight regularization and ReLU activation.
Model
Resolution
Epochs
Top-1
Top-5
Download
ResNet-50
224x224
90
76.1
92.9
config
ResNet-50
224x224
200
77.1
93.5
config
ResNet-101
224x224
200
78.3
94.2
config
ResNet-152
224x224
200
78.7
94.3
config
ResNet-RS models trained with various settings
We support state-of-the-art ResNet-RS image
classification models with features:
ResNet-RS architectural changes and Swish activation. (Note that ResNet-RS
adopts ReLU activation in the paper.)
Regularization methods including Random Augment, 4e-5 weight decay, stochastic
depth, label smoothing and dropout.
New training methods including a 350-epoch schedule, cosine learning rate and
EMA.
Configs are in this directory .
Model
Resolution
Params (M)
Top-1
Top-5
Download
ResNet-RS-50
160x160
35.7
79.1
94.5
config | ckpt
ResNet-RS-101
160x160
63.7
80.2
94.9
config | ckpt
ResNet-RS-101
192x192
63.7
81.3
95.6
config | ckpt
ResNet-RS-152
192x192
86.8
81.9
95.8
config | ckpt
ResNet-RS-152
224x224
86.8
82.5
96.1
config | ckpt
ResNet-RS-152
256x256
86.8
83.1
96.3
config | ckpt
ResNet-RS-200
256x256
93.4
83.5
96.6
config | ckpt
ResNet-RS-270
256x256
130.1
83.6
96.6
config | ckpt
ResNet-RS-350
256x256
164.3
83.7
96.7
config | ckpt
ResNet-RS-350
320x320
164.3
84.2
96.9
config | ckpt
We support ViT and DEIT implementations.
ViT models trained under the DEIT settings:
model
resolution
Top-1
Top-5
Download
ViT-ti16
224x224
73.4
91.9
ckpt
ViT-s16
224x224
79.4
94.7
ckpt
ViT-b16
224x224
81.8
95.8
ckpt
ViT-l16
224x224
82.2
95.8
ckpt
Object Detection and Instance Segmentation
Common Settings and Notes
We provide models adopting ResNet-FPN and
SpineNet backbones based on detection frameworks:
Models are all trained on COCO train2017 and
evaluated on COCO val2017.
Training details:
Models finetuned from ImageNet pretrained
checkpoints adopt the 12 or 36 epochs schedule. Models trained from scratch
adopt the 350 epochs schedule.
The default training data augmentation implements horizontal flipping and
scale jittering with a random scale between [0.5, 2.0].
Unless noted, all models are trained with l2 weight regularization and ReLU
activation.
We use batch size 256 and stepwise learning rate that decays at the last 30
and 10 epoch.
We use square image as input by resizing the long side of an image to the
target size then padding the short side with zeros.
COCO Object Detection Baselines
RetinaNet (ImageNet pretrained)
Backbone
Resolution
Epochs
FLOPs (B)
Params (M)
Box AP
Download
R50-FPN
640x640
12
97.0
34.0
34.3
config
R50-FPN
640x640
72
97.0
34.0
36.8
config | ckpt
RetinaNet (Trained from scratch) with training features including:
Stochastic depth with drop rate 0.2.
Swish activation.
Backbone
Resolution
Epochs
FLOPs (B)
Params (M)
Box AP
Download
SpineNet-49
640x640
500
85.4
28.5
44.2
config | TB.dev
SpineNet-96
1024x1024
500
265.4
43.0
48.5
config | TB.dev
SpineNet-143
1280x1280
500
524.0
67.0
50.0
config | TB.dev
Mobile-size RetinaNet (Trained from scratch):
Backbone
Resolution
Epochs
FLOPs (B)
Params (M)
Box AP
Download
MobileNetv2
256x256
600
-
2.27
23.5
config
Mobile SpineNet-49
384x384
600
1.0
2.32
28.1
config | ckpt
Instance Segmentation Baselines
Mask R-CNN (Trained from scratch)
Backbone
Resolution
Epochs
FLOPs (B)
Params (M)
Box AP
Mask AP
Download
ResNet50-FPN
640x640
350
227.7
46.3
42.3
37.6
config
SpineNet-49
640x640
350
215.7
40.8
42.6
37.9
config
SpineNet-96
1024x1024
500
315.0
55.2
48.1
42.4
config
SpineNet-143
1280x1280
500
498.8
79.2
49.3
43.4
config
Cascade RCNN-RS (Trained from scratch)
Backbone
Resolution
Epochs
Params (M)
Box AP
Mask AP
Download
SpineNet-49
640x640
500
56.4
46.4
40.0
config
SpineNet-96
1024x1024
500
70.8
50.9
43.8
config
SpineNet-143
1280x1280
500
94.9
51.9
45.0
config
We support DeepLabV3 and
DeepLabV3+ architectures, with
Dilated ResNet backbones.
Backbones are pre-trained on ImageNet.
Model
Backbone
Resolution
Steps
mIoU
Download
DeepLabV3
Dilated Resnet-101
512x512
30k
78.7
DeepLabV3+
Dilated Resnet-101
512x512
30k
79.2
Model
Backbone
Resolution
Steps
mIoU
Download
DeepLabV3+
Dilated Resnet-101
1024x2048
90k
78.79
Common Settings and Notes
Kinetics-400 Action Recognition Baselines
Model
Input (frame x stride)
Top-1
Top-5
Download
SlowOnly
8 x 8
74.1
91.4
config
SlowOnly
16 x 4
75.6
92.1
config
R3D-50
32 x 2
77.0
93.0
config
R3D-RS-50
32 x 2
78.2
93.7
config
R3D-RS-101
32 x 2
79.5
94.2
-
R3D-RS-152
32 x 2
79.9
94.3
-
R3D-RS-200
32 x 2
80.4
94.4
-
R3D-RS-200
48 x 2
81.0
-
-
MoViNet-A0-Base
50 x 5
69.40
89.18
-
MoViNet-A1-Base
50 x 5
74.57
92.03
-
MoViNet-A2-Base
50 x 5
75.91
92.63
-
MoViNet-A3-Base
120 x 2
79.34
94.52
-
MoViNet-A4-Base
80 x 3
80.64
94.93
-
MoViNet-A5-Base
120 x 2
81.39
95.06
-
Kinetics-600 Action Recognition Baselines
Model
Input (frame x stride)
Top-1
Top-5
Download
SlowOnly
8 x 8
77.3
93.6
config
R3D-50
32 x 2
79.5
94.8
config
R3D-RS-200
32 x 2
83.1
-
-
R3D-RS-200
48 x 2
83.8
-
-
MoViNet-A0-Base
50 x 5
72.05
90.92
config
MoViNet-A1-Base
50 x 5
76.69
93.40
config
MoViNet-A2-Base
50 x 5
78.62
94.17
config
MoViNet-A3-Base
120 x 2
81.79
95.67
config
MoViNet-A4-Base
80 x 3
83.48
96.16
config
MoViNet-A5-Base
120 x 2
84.27
96.39
config