Skip to content

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

License

Notifications You must be signed in to change notification settings

D-Robotics-AI-Lab/DOSOD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation



Yonghao He1,*,🌟 , Hu Su2,*,πŸ“§, Haiyong Yu1,*, Cong Yang3, Wei Sui1, Cong Wang1, Song Liu4,πŸ“§

* Equal contribution, 🌟 Project lead, πŸ“§ Corresponding author

1 D-Robotics,
2 State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,
3 BeeLab, School of Future Science and Engineering, Soochow University,
4 the School of Information Science and Technology, ShanghaiTech University

arxiv paper license

πŸ”₯ Updates

[2024-12-27]: Decoupled Open-Set Object Detector (DOSOD) with ultra real-time speed and superior accuracy is released.

1. Introduction

1.1 Brief Introduction of DOSOD

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.

1.2 Repo Structure

Our implementation is based on YOLO-World, the newly added code can be found in the following scripts:

2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival and COCO val2017. All pre-trained models are released.

2.1 Zero-shot Evaluation on LVIS minival

model Pre-train Data Size APmini APr APc APf weights
O365+GoldG 640 24.3 16.6 22.1 27.7 HF Checkpoints πŸ€—
O365+GoldG 640 28.6 19.7 26.6 31.9 HF Checkpoints πŸ€—
O365+GoldG 640 32.5 22.3 30.6 36.1 HF Checkpoints πŸ€—
O365+GoldG 640 26.2 19.1 23.6 29.8 HF Checkpoints πŸ€—
O365+GoldG 640 31.0 23.8 29.2 33.9 HF Checkpoints πŸ€—
O365+GoldG 640 35.0 27.1 32.8 38.3 HF Checkpoints πŸ€—
YOLO-Worldv2-S O365+GoldG 640 22.7 16.3 20.8 25.5 HF Checkpoints πŸ€—
YOLO-Worldv2-M O365+GoldG 640 30.0 25.0 27.2 33.4 HF Checkpoints πŸ€—
YOLO-Worldv2-L O365+GoldG 640 33.0 22.6 32.0 35.8 HF Checkpoints πŸ€—
DOSOD-S O365+GoldG 640 26.7 19.9 25.1 29.3 HF Checkpoints πŸ€—
DOSOD-M O365+GoldG 640 31.3 25.7 29.6 33.7 HF Checkpoints πŸ€—
DOSOD-L O365+GoldG 640 34.4 29.1 32.6 36.6 HF Checkpoints πŸ€—

NOTE: The results of YOLO-Worldv1 from repo and paper are different.

2.2 Zero-shot Inference on COCO dataset

model Pre-train Data Size AP AP50 AP75
O365+GoldG 640 37.6 52.3 40.7
O365+GoldG 640 42.8 58.3 46.4
O365+GoldG 640 44.4 59.8 48.3
YOLO-Worldv2-S O365+GoldG 640 37.5 52.0 40.7
YOLO-Worldv2-M O365+GoldG 640 42.8 58.2 46.7
YOLO-Worldv2-L O365+GoldG 640 45.4 61.0 49.4
DOSOD-S O365+GoldG 640 36.1 51.0 39.1
DOSOD-M O365+GoldG 640 41.7 57.1 45.2
DOSOD-L O365+GoldG 640 44.6 60.5 48.4

2.3 Latency On RTX 4090

We utilize the tool of trtexec in TensorRT 8.6.1.6 to assess the latency in FP16 mode. All models are re-parameterized with 80 categories from COCO. Log info can be found by click the FPS.

model Params FPS
YOLO-Worldv1-S 13.32M 1007
YOLO-Worldv1-M 28.93M 702
YOLO-Worldv1-L 47.38M 494
YOLO-Worldv2-S 12.66M 1221
YOLO-Worldv2-M 28.20M 771
YOLO-Worldv2-L 46.62M 553
DOSOD-S 11.48M 1582
DOSOD-M 26.31M 922
DOSOD-L 44.19M 632

NOTE: FPS = 1000 / GPU Compute Time[mean]

2.4 Latency On RDK X5

We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of D-Robotics RDK X5. The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.

model FPS (1 thread) FPS (8 threads)
YOLO-Worldv2-S
(INT16/INT8)
5.962/11.044 6.386/12.590
YOLO-Worldv2-M
(INT16/INT8)
4.136/7.290 4.340/7.930
YOLO-Worldv2-L
(INT16/INT8)
2.958/5.377 3.060/5.720
DOSOD-S
(INT16/INT8)
12.527/31.020 14.657/47.328
DOSOD-M
(INT16/INT8)
8.531/20.238 9.471/26.36
DOSOD-L
(INT16/INT8)
5.663/12.799 6.069/14.939

3. Getting Started

Most of the steps are consistent with those in YOLO-World README.md file. Some extra things need attention are as follows:

  • clone project: git clone https://github.com/D-Robotics-AI-Lab/DOSOD.git
  • latency evaluation: we provide script to evaluate the latency on NVIDIA GPU
  • note: We pre-train DOSOD on 8 NVIDIA RTX 4090 GPUs with a batchsize of 128 while YOLO-World uses 32 NVIDIA V100 GPUs with the batchsize of 512.

4. Reparameterization and Inference

4.1 On NVIDIA RTX 4090

  • Step 1: generate texts embeddings
python tools/generate_text_prompts_dosod.py path_to_config_file path_to_model_file --text path_to_texts_json_file --out-dir dir_to_save_embedding_npy_file

path_to_config_file is the config for training
path_to_model_file the pth model file corresponding to path_to_config_file
path_to_texts_json_file contains the vocabulary, for example data/texts/coco_class_texts.json

  • Step 2: reparameterize model weights
python tools/reparameterize_dosod.py --model path_to_model_file --out-dir dir_to_save_rep_model_file --text-embed path_to_embedding_npy_file

path_to_embedding_npy_file is the output from step 1

  • Step 3: export onnx using rep-style config
python deploy/export_onnx.py path_to_rep_config_file path_to_rep_model_file --without-nms --work-dir dir_to_save_rep_onnx_file

path_to_rep_config_file is the modified config for rep, for example configs/dosod/rep_dosod_mlp3x_s_100e_1x8gpus_obj365v1_goldg_train_lvis_minival.py path_to_rep_model_file is the output from step 2

  • Step 4: run onnx demo
python deploy/onnx_demo.py path_to_rep_onnx_file path_to_test_image path_to_texts_json_file --output-dir dir_to_save_result_image --onnx-nms

path_to_rep_onnx_file is the output from step 3

4.2 On RDK X5

To make the model available for RDK X5, we need to use another config file in Step 3:
path_to_rep_config_file should be files with suffix _d-robotics.py, for exmaple configs/dosod/rep_dosod_mlp3x_s_d-robotics.py For more details, you can refer to code file.

To run onnx model on RDK X5, you can refer to the website for more help.

Acknowledgement

We sincerely thank YOLO-World, mmyolo, mmdetection, GLIP, and transformers for providing their wonderful code to the community!

Citations

If you find DOSOD is useful in your research or applications, please consider giving us a star 🌟 and citing it.

@inproceedings{He2024DOSOD,
  title={A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space},
  author={He, Yonghao and Su, Hu and Yu, Haiyong and Yang, Cong and Sui, Wei and Wang, Cong and Liu, Song},
  booktitle={arXiv:2412.14680},
  year={2024}
}

Licence

DOSOD is under the GPL-v3 Licence and is supported for commercial usage. If you need a commercial license for DOSOD, please feel free to contact us.

About

A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages