Skip to content

Latest commit

 

History

History
92 lines (70 loc) · 6.36 KB

README.md

File metadata and controls

92 lines (70 loc) · 6.36 KB

Revisiting Few Shot Object Detection with Vision-Language Models

arXiv models challenge

SWITCH TO MQDET BRANCH FOR RUNNING MQDET EXPTS

IMP NOTE: Use the test_set.json file for evaluating performance.

teaser.png

We are releasing a Foundational FSOD challenge as part of the Workshop on Visual Perception and Learning in an Open World at CVPR 2024. We are accepting submissions till 7th June 2024!

Abstract

Few-shot object detection (FSOD) benchmarks have advanced techniques for detecting new categories with limited annotations. Existing benchmarks repurpose wellestablished datasets like COCO by partitioning categories into base and novel classes for pre-training and finetuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice. Rather than only pre-training on a small number of base categories, we argue that it is more practical to fine-tune a foundation model (e.g., a vision-language model (VLM) pre-trained on webscale data) for a target domain. Surprisingly, we find that zero-shot inference from VLMs like GroundingDINO significantly outperforms the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models can still be misaligned to target concepts of interest. For example, trailers on the web may be different from trailers in the context of autonomous vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on K-shots per target class. Further, we note that current FSOD benchmarks are actually federated datasets containing exhaustive annotations for each category on a subset of the data. We leverage this insight to propose simple strategies for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of our approach on LVIS and nuImages, improving over prior work by 5.9 AP.

Installation

See installation instructions.

Data

See datasets/README.md

Models

Create models/ in the root directory and download pre-trained model here

Training

python train_net.py --num-gpus 1 --config-file <config_path>  --pred_all_class  OUTPUT_DIR_PREFIX <root_output_dir>

Config Details

Key Config Fields

  • DATASETS.TRAIN: Specify training split according to registered datasets.
  • DATASETS.TEST: Specify testing split according to registered datasets.
  • ROI_BOX_HEAD.FED_LOSS_NUM_CAT: Number of categories to be sampled for FedLoss
  • ROI_BOX_HEAD.USE_FED_LOSS: Flag to enable federated loss
  • ROI_BOX_HEAD.INVERSE_WEIGHTS: Flag to enable inverse frequency weights with federated loss
  • ROI_BOX_HEAD.ALL_ANN_FILE: Used in sampling strategy for FedLoss. If using with Pseudo-Negatives, use predictions from the teacher model.

Pseudo-Negatives Training

  1. Train (or use an off-the-shelf pre-trained) teacher model.
  2. Make a new config to run inference on the FSOD trainset (e.g. Set DATASETS.TEST to nuimages_fsod_train_seed_0_shots_10)
  3. Convert the generated predictions .pth file to the COCO format by using tools/convert_preds_to_ann.py. Be sure to specify the confidence threshold to filter pseudolabels. The predictions are saved in the directory corresponding to the trainset evaluation. See sample command below:
python tools/convert_preds_to_ann.py --pred_path_train <path_trainset_eval_pth_file> --dataset_name nuimages_fsod_train_seed_0_shots_10 --conf_thresh 0.2
  1. Set ROI_BOX_HEAD.ALL_ANN_FILE to the generated predictions.

Inference

python train_net.py --num-gpus 8 --config-file <config_path>  --pred_all_class --eval-only  MODEL.WEIGHTS <model_path> OUTPUT_DIR_PREFIX <root_output_dir>

TODO

  • Code cleanup
  • Release FSOD training files
  • FIOD support and config
  • FSOD Data split creation : nuImages along with new split
  • Release trained model

  • LVIS support in data and training models

Acknowledgment

We thank the authors of the following repositories for their open-source implementations which were used in building the current codebase:

  1. Detic: Detecting Twenty-thousand Classes using Image-level Supervision
  2. Detectron2

Citation

If you find our paper and code repository useful, please cite us:

@article{madan2023revisiting,
  title={Revisiting Few-Shot Object Detection with Vision-Language Models},
  author={Madan, Anish and Peri, Neehar and Kong, Shu and Ramanan, Deva},
  journal={arXiv preprint arXiv:2312.14494},
  year={2023}
}