Skip to content

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Notifications You must be signed in to change notification settings

Sense-GVT/DeCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeCLIP is an open-source project that welcomes any contribution and feedback. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible as well as a standardized toolkit to reimplement existing methods and develop their own new Contrastive Language-Image Pretraining methods. You can find the following things in this repo:

  • Pre-trained models and training codes to reproduce various Contrastive Language-Image Pretraining methods(e.g. CLIP, DeCLIP, SLIP, FILIP).
  • Various benchmark datasets for Large-scale Contrastive Language-Image Pretraining task.
  • Zero-shot transfer and linear classification evaluation scripts for downstream datasets.

We aims to democratize large-scale CLIP to build a fair and reproducible CLIP community. Our paper are available on:

DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

CLIP-Benchmark: Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision.

Call for Papers & Participation

📢 Call for Papers & Participation: ECCV Workshop and Challenge on Computer Vision in the Wild (CVinW)

CVinW [Workshop] ICinW [IC Challenge] ODinW [OD Challenge]

Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

Declip framework

Updates

2022-09-19 📢 Call for Papers & Participation: ECCV Workshop and Challenge on Computer Vision in the Wild (CVinW)

2022-06-25 We release the checkpoints of each models for benchmark.

2022-03-10 We update the result of CLIP-Benchmark and release our YFCC15M dataset.

2022-02-22 We release our training code, benchmark, and model zoo! We will release the checkpoints of each models after align the results soon. We hope this project could serve the growing Contrastive Language-Image Pretraining research community by providing a flexible as well as standardized toolkit.

2021-11-06 First Commit, Our code, dataset and models will be relased soon.

Installation

Please refer to get_started.md for installation and dataset_prepare.md for dataset preparation.

Get Started

Install PyTorch. The code has been tested with CUDA 11.2/CuDNN 8.1.0, PyTorch 1.8.1.

First, prepare pre-training datasets and downstream classification datasets through get_started.md.

We organize the different models trained on different data through separate [experimental catalogs] (experiments/), you can check the dir for detail.

1. Pre-training

You can run run.sh directly to train the corresponding model. We train most of our models on 4x8-gpu nodes. Check the config in the experiment directory of the corresponding model for details.

2. Zero-shot Evalution

You can add a argument --evaluate on run script for zero-shot evalution.

DeCLIP Model-Zoo

Our pretrain visual backbone model (w/o text encoder)

Method Dataset Model Epochs 0-shot Config Paper Weights
DeCLIP Declip-88M ResNet50 32 62.5 config paper GoogleDriver
DeCLIP Declip-88M ViT-B32 32 66.2 config paper GoogleDriver

Our pretrain declip model (w text encoder)

Method Dataset Model Epochs 0-shot Config Paper Weights
DeCLIP Declip-88M ResNet50 32 62.5 config paper GoogleDriver
DeCLIP Declip-88M ViT-B32 32 66.2 config paper GoogleDriver

CLIP-Benchmark

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision. Our paper is available on Arxiv.

Witnessing its great success, researchers continue to push the frontier of CLIP. For instance, SLIP, DeCLIP and FILIP achieve considerable improvements via embracing different kinds of supervision within the image-text pairs. However, it remains challenging to make fair comparison between these methods. This is because they do not choose consistent training recipes and even use different data. We propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. Moreover, we further combine DeCLIP with FILIP, bringing us the strongest variant DeFILIP.

Declip framework

Supported Models:

The following models are pre-trained on YFCC15M and evaluated on ImageNet-1K (ILSVRC2012).

Method Dataset Model Epochs 0-shot Config Paper Weights
CLIP YFCC-15M ViT-B32 32 32.8 config paper GoogleDriver
DeCLIP YFCC-15M ViT-B32 32 43.2 config paper GoogleDriver
SLIP YFCC-15M ViT-B32 32 34.3 config paper GoogleDriver
FILIP YFCC-15M ViT-B32 32 39.5 config paper GoogleDriver
DeFILIP YFCC-15M ViT-B32 32 45.0 config paper GoogleDriver
Method Dataset Model Epochs 0-shot Config Paper Weights
CLIP YFCC-15M ResNet50 32 37.2 config paper GoogleDriver
DeCLIP YFCC-15M ResNet50 32 44.4 config paper GoogleDriver
SLIP YFCC-15M ResNet50 32 28.5 config paper --
FILIP YFCC-15M ResNet50 32 21.3 config paper --

Supported datasets:

Dataset Samples download Paper
YFCC-15M 15,388,848 google driver url

Changelog

2022-02-22 Realase our Training code

2021-11-06 First Commit

Citation

@inproceedings{li2022supervision,
      title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image  Pre-training Paradigm},
      author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
      booktitle={International Conference on Learning Representations},
      year={2022},
      url={https://openreview.net/forum?id=zq1iJkNk3uN}
}

@misc{cui2022democratizing,
      title={Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision}, 
      author={Yufeng Cui and Lichen Zhao and Feng Liang and Yangguang Li and Jing Shao},
      year={2022},
      eprint={2203.05796},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

For academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact the authors.

Acknowledgement

DeCLIP is an open-source project that welcomes any contribution and feedback. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible as well as a standardized toolkit to reimplement existing methods and develop their own new Contrastive Language-Image Pretraining methods.

Our framework is based on prototype.