Yonghao He1,*,π , Hu Su2,*,π§, Haiyong Yu1,*, Cong Yang3, Wei Sui1, Cong Wang1, Song Liu4,π§
* Equal contribution, π Project lead, π§ Corresponding author
1 D-Robotics,
2 State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,
3 BeeLab, School of Future Science and Engineering, Soochow University,
4 the School of Information Science and Technology, ShanghaiTech
University
[2024-12-27]
: Decoupled Open-Set Object Detector (DOSOD) with ultra real-time speed and superior accuracy is released.
Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.
Our implementation is based on YOLO-World, the newly added code can be found in the following scripts:
- yolo_world/models/detectors/dosod.py yolo_world/models/dense_heads/dosod_head.py
The two scripts contain the core code segments of DOSOD. - configs/dosod
This folder contains all DOSOD configs for training, evaluation and inference. - tools/generate_text_prompts_dosod.py
Generating texts embeddings for DOSOD - tools/reparameterize_dosod.py
Reparameterizing original weights with generated texts embeddings - tools/count_num_parameters.py
Simple code for calculating the amount of parameters - tools/evaluate_latency.sh
The shell for latency evaluation on NVIDIA GPU
Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival
and COCO val2017
.
All pre-trained models are released.
model | Pre-train Data | Size | APmini | APr | APc | APf | weights |
---|---|---|---|---|---|---|---|
YOLO-Worldv1-S (repo) |
O365+GoldG | 640 | 24.3 | 16.6 | 22.1 | 27.7 | HF Checkpoints π€ |
YOLO-Worldv1-M (repo) |
O365+GoldG | 640 | 28.6 | 19.7 | 26.6 | 31.9 | HF Checkpoints π€ |
YOLO-Worldv1-L (repo) |
O365+GoldG | 640 | 32.5 | 22.3 | 30.6 | 36.1 | HF Checkpoints π€ |
YOLO-Worldv1-S (paper) |
O365+GoldG | 640 | 26.2 | 19.1 | 23.6 | 29.8 | HF Checkpoints π€ |
YOLO-Worldv1-M (paper) |
O365+GoldG | 640 | 31.0 | 23.8 | 29.2 | 33.9 | HF Checkpoints π€ |
YOLO-Worldv1-L (paper) |
O365+GoldG | 640 | 35.0 | 27.1 | 32.8 | 38.3 | HF Checkpoints π€ |
YOLO-Worldv2-S | O365+GoldG | 640 | 22.7 | 16.3 | 20.8 | 25.5 | HF Checkpoints π€ |
YOLO-Worldv2-M | O365+GoldG | 640 | 30.0 | 25.0 | 27.2 | 33.4 | HF Checkpoints π€ |
YOLO-Worldv2-L | O365+GoldG | 640 | 33.0 | 22.6 | 32.0 | 35.8 | HF Checkpoints π€ |
DOSOD-S | O365+GoldG | 640 | 26.7 | 19.9 | 25.1 | 29.3 | HF Checkpoints π€ |
DOSOD-M | O365+GoldG | 640 | 31.3 | 25.7 | 29.6 | 33.7 | HF Checkpoints π€ |
DOSOD-L | O365+GoldG | 640 | 34.4 | 29.1 | 32.6 | 36.6 | HF Checkpoints π€ |
NOTE: The results of YOLO-Worldv1 from repo and paper are different.
model | Pre-train Data | Size | AP | AP50 | AP75 |
---|---|---|---|---|---|
YOLO-Worldv1-S (paper) |
O365+GoldG | 640 | 37.6 | 52.3 | 40.7 |
YOLO-Worldv1-M (paper) |
O365+GoldG | 640 | 42.8 | 58.3 | 46.4 |
YOLO-Worldv1-L (paper) |
O365+GoldG | 640 | 44.4 | 59.8 | 48.3 |
YOLO-Worldv2-S | O365+GoldG | 640 | 37.5 | 52.0 | 40.7 |
YOLO-Worldv2-M | O365+GoldG | 640 | 42.8 | 58.2 | 46.7 |
YOLO-Worldv2-L | O365+GoldG | 640 | 45.4 | 61.0 | 49.4 |
DOSOD-S | O365+GoldG | 640 | 36.1 | 51.0 | 39.1 |
DOSOD-M | O365+GoldG | 640 | 41.7 | 57.1 | 45.2 |
DOSOD-L | O365+GoldG | 640 | 44.6 | 60.5 | 48.4 |
We utilize the tool of trtexec
in TensorRT 8.6.1.6 to assess the latency in FP16 mode.
All models are re-parameterized with 80 categories from COCO.
Log info can be found by click the FPS.
model | Params | FPS |
---|---|---|
YOLO-Worldv1-S | 13.32M | 1007 |
YOLO-Worldv1-M | 28.93M | 702 |
YOLO-Worldv1-L | 47.38M | 494 |
YOLO-Worldv2-S | 12.66M | 1221 |
YOLO-Worldv2-M | 28.20M | 771 |
YOLO-Worldv2-L | 46.62M | 553 |
DOSOD-S | 11.48M | 1582 |
DOSOD-M | 26.31M | 922 |
DOSOD-L | 44.19M | 632 |
NOTE: FPS = 1000 / GPU Compute Time[mean]
We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of D-Robotics RDK X5. The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.
model | FPS (1 thread) | FPS (8 threads) |
---|---|---|
YOLO-Worldv2-S (INT16/INT8) |
5.962/11.044 | 6.386/12.590 |
YOLO-Worldv2-M (INT16/INT8) |
4.136/7.290 | 4.340/7.930 |
YOLO-Worldv2-L (INT16/INT8) |
2.958/5.377 | 3.060/5.720 |
DOSOD-S (INT16/INT8) |
12.527/31.020 | 14.657/47.328 |
DOSOD-M (INT16/INT8) |
8.531/20.238 | 9.471/26.36 |
DOSOD-L (INT16/INT8) |
5.663/12.799 | 6.069/14.939 |
Most of the steps are consistent with those in YOLO-World README.md file. Some extra things need attention are as follows:
- clone project:
git clone https://github.com/D-Robotics-AI-Lab/DOSOD.git
- latency evaluation: we provide script to evaluate the latency on NVIDIA GPU
- note: We pre-train DOSOD on 8 NVIDIA RTX 4090 GPUs with a batchsize of 128 while YOLO-World uses 32 NVIDIA V100 GPUs with the batchsize of 512.
- Step 1: generate texts embeddings
python tools/generate_text_prompts_dosod.py path_to_config_file path_to_model_file --text path_to_texts_json_file --out-dir dir_to_save_embedding_npy_file
path_to_config_file
is the config for training
path_to_model_file
the pth model file corresponding topath_to_config_file
path_to_texts_json_file
contains the vocabulary, for exampledata/texts/coco_class_texts.json
- Step 2: reparameterize model weights
python tools/reparameterize_dosod.py --model path_to_model_file --out-dir dir_to_save_rep_model_file --text-embed path_to_embedding_npy_file
path_to_embedding_npy_file
is the output from step 1
- Step 3: export onnx using rep-style config
python deploy/export_onnx.py path_to_rep_config_file path_to_rep_model_file --without-nms --work-dir dir_to_save_rep_onnx_file
path_to_rep_config_file
is the modified config for rep, for exampleconfigs/dosod/rep_dosod_mlp3x_s_100e_1x8gpus_obj365v1_goldg_train_lvis_minival.py
path_to_rep_model_file
is the output from step 2
- Step 4: run onnx demo
python deploy/onnx_demo.py path_to_rep_onnx_file path_to_test_image path_to_texts_json_file --output-dir dir_to_save_result_image --onnx-nms
path_to_rep_onnx_file
is the output from step 3
To make the model available for RDK X5, we need to use another config file in Step 3:
path_to_rep_config_file
should be files with suffix _d-robotics.py, for exmaple configs/dosod/rep_dosod_mlp3x_s_d-robotics.py
For more details, you can refer to code file.
To run onnx model on RDK X5, you can refer to the website for more help.
We sincerely thank YOLO-World, mmyolo, mmdetection, GLIP, and transformers for providing their wonderful code to the community!
If you find DOSOD is useful in your research or applications, please consider giving us a star π and citing it.
@inproceedings{He2024DOSOD,
title={A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space},
author={He, Yonghao and Su, Hu and Yu, Haiyong and Yang, Cong and Sui, Wei and Wang, Cong and Liu, Song},
booktitle={arXiv:2412.14680},
year={2024}
}
DOSOD is under the GPL-v3 Licence and is supported for commercial usage. If you need a commercial license for DOSOD, please feel free to contact us.