By Tingting Liang*, Xiaojie Chu*, Yudong Liu*, Yongtao Wang, Zhi Tang, Wei Chu, Jingdong Chen, Haibin Ling.
This repo is the official implementation of CBNetV2. It is based on mmdetection and Swin Transformer for Object Detection.
Contact us with [email protected], [email protected], [email protected].
- 2024/10/21: Update code for CB-EVA-02-L. We achieve new SOTA instance segmentation results (55.5 -> 56.1 mask AP) on COCO!
CBNetV2 achieves strong single-model performance on COCO object detection (60.1 box AP
and 52.3 mask AP
on test-dev) without extra training data.
More results and models can be found in model zoo
Backbone | Lr Schd | box mAP (minival) | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|
DB-ResNet50 | 1x | 40.8 | 69M | 284G | config | github | github |
Backbone | Lr Schd | box mAP (minival) | mask mAP (minival) | #params | FLOPs | config | log | model |
---|---|---|---|---|---|---|---|---|
DB-Swin-T | 3x | 50.2 | 44.5 | 76M | 357G | config | github | github |
Backbone | Lr Schd | box mAP (minival/test-dev) | mask mAP (minival/test-dev) | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|
DB-Swin-S | 3x | 56.3/56.9 | 48.6/49.1 | 156M | 1016G | config | github |
We use ImageNet-22k pretrained checkpoints of Swin-B and Swin-L. Compared to regular HTC, our HTC uses 4conv1fc in bbox head.
Backbone | Lr Schd | box mAP (minival/test-dev) | mask mAP (minival/test-dev) | #params | FLOPs | config | model |
---|---|---|---|---|---|---|---|
DB-Swin-B | 20e | 58.4/58.7 | 50.7/51.1 | 235M | 1348G | config | github |
DB-Swin-L | 1x | 59.1/59.4 | 51.0/51.6 | 453M | 2162G | config (test only) | github |
DB-Swin-L (TTA) | 1x | 59.6/60.1 | 51.8/52.3 | 453M | - | config (test only) | github |
TTA denotes test time augmentation.
Backbone | Lr Schd | mask mAP (test-dev) | #params | config | model |
---|---|---|---|---|---|
DB-EVA02-L | 1x | 56.1 | 674M | config | HF |
Notes:
- Pre-trained models of Swin Transformer can be downloaded from Swin Transformer for ImageNet Classification.
- Pre-trained models of EVA02 can be downloaded from EVA02 pretrain.
Please refer to get_started.md for installation and dataset preparation.
# single-gpu testing (w/o segm result)
python tools/test.py <CONFIG_FILE> <DET_CHECKPOINT_FILE> --eval bbox
# multi-gpu testing (w/ segm result)
tools/dist_test.sh <CONFIG_FILE> <DET_CHECKPOINT_FILE> <GPU_NUM> --eval bbox segm
To train a detector with pre-trained models, run:
# multi-gpu training
tools/dist_train.sh <CONFIG_FILE> <GPU_NUM>
For example, to train a Faster R-CNN model with a Duel-ResNet50
backbone and 8 gpus, run:
# path of pre-training model (resnet50) is already in config
tools/dist_train.sh configs/cbnet/faster_rcnn_cbv2d1_r50_fpn_1x_coco.py 8
Another example, to train a Mask R-CNN model with a Duel-Swin-T
backbone and 8 gpus, run:
tools/dist_train.sh configs/cbnet/mask_rcnn_cbv2_swin_tiny_patch4_window7_mstrain_480-800_adamw_3x_coco.py 8 --cfg-options model.pretrained=<PRETRAIN_MODEL>
Following Swin Transformer for Object Detection, we use apex for mixed precision training by default. To install apex, run:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
We list some documents and tutorials from MMDetection, which may be helpful to you.
If you use our code/model, please consider to cite our paper CBNet: A Composite Backbone Network Architecture for Object Detection.
@ARTICLE{9932281,
author={Liang, Tingting and Chu, Xiaojie and Liu, Yudong and Wang, Yongtao and Tang, Zhi and Chu, Wei and Chen, Jingdong and Ling, Haibin},
journal={IEEE Transactions on Image Processing},
title={CBNet: A Composite Backbone Network Architecture for Object Detection},
year={2022},
volume={31},
pages={6893-6906},
doi={10.1109/TIP.2022.3216771}}
The project is only free for academic research purposes, but needs authorization for commerce. For commerce permission, please contact [email protected].
Original CBNet: See CBNet: A Novel Composite Backbone Network Architecture for Object Detection.