S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment (AAAI'24 Oral)
S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment
Sheng Zhang, Muzammal Naseer, Guangyi Chen, Zhiqiang Shen, Salman Khan, Kun Zhang, Fahad Khan
This is official repo for our paper: S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment.
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal target vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address the new problem, we propose the Self Structural Semantic Alignment (S3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR algorithm includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-train the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S3A method substantially improves over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average.
We propose a Self Structural Semantic Alignment (
We propose a Cluster-Vote-Prompt-Realign (CVPR) algorithm to reliably derive reliable structural semantic alignments between images and the large vocabulary. Clustering unearths inherent grouping structures within image embeddings, producing meaningful image semantics. Voting associates each cluster with initial category candidates, representing potential pseudo-alignments. These two steps can be executed iteratively to obtain more reliable candidates. Prompting leverages the power of large language models (LLMs) to discern nuanced candidates by augmenting prompts with discriminative attributes. Re-alignment represents calibrating the cluster-vocabulary alignment with LLM-augmented prompts as pseudo structural semantic alignment labels.
All the incorporated datasets are prepared under the {HOME}/dataset
directory, and config.py
contains global path information.
Change global file paths in config.py
, and dataset paths in data/imagenet_datasets.py
if necessary.
For environment setup, please follow the below instructions:
### create environment
conda create -n sssa
conda activate sssa
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch -y
conda install scikit-learn tensorboard scipy ipykernel setuptools==58.0.4 -y
pip install -r requirements.txt
conda install -c conda-forge torchmetrics
### install dependent packages
mv robustness ./robustness_out
cd ./robustness_out
mv robustness ../
cd ..
rm -rf robustness_out
### (optional) if use soft-score
conda install -c conda-forge sentence-transformers
pip uninstall clip
### add notebook environment
python -m ipykernel install --user --name sssa
jupyter kernelspec list
nohup jupyter lab --allow-root >> ~/.jupyter.log 2>&1 &
Download cache.zip
from link and unzip it at ipynb/
. All clusters and prompts are included.
Specifically, we release here our checkpoints/prompts for each benchmark in our paper:
Setting | Dataset | Prompt | Model |
---|---|---|---|
RZSC | StanfordDogs | link | link |
RZSC | ImageNet-100 | link | link |
RZSC | ImageNet-1K | link | link |
RZSC | BREEDS-Living17 | link | link |
RZSC | BREEDS-Nonliving26 | link | link |
RZSC | BREEDS-Entity13 | link | link |
RZSC | BREEDS-Entity30 | link | link |
RZSC-OOV | Caltech-101 | link | |
RZSC-OOV | CIFAR100 | link | |
RZSC-OOV | Oxford-IIIT Pet | link |
Checkpoints coming soon...
The preparation includes two three simple steps:
given a target dataset, we simply
- (Optional) Run
Build-Vocabulary.ipynb
to build our vocabulary from all ImageNet (-21K and -1K) classes (synsets listed in fileimagenet21k_wordnet_ids.txt
) based on WordNet synsets (listed in filewordnet_nouns.pkl
). Since our text classifier is frozen, the extracted classifier is saved atipynb/cache/wordnet_classifier.pth
forCLIPViT-B/16
andipynb/cache/wordnet_classifier_L.pth
for CLIPViT-L/14
. - (Optional) Run
Clustering.ipynb
to compute the intial KMeans clusters on CLIP visual features. The resultant clustering will be saved inipynb/cache/cluster/
. - (Optional) Run CVPR algorithm with
CVPR-algo.ipynb
and obtain the results saved atcache/training/cvpr_result-data={args.dataset}-clip(-L).pth
. The results provide the self training with structural and instance semantic alignment. Here, the CVPR algorithm provides the initial semantic structural alignment labels. Therefore, our CVPR algorithm will be executed per-epoch during online training in the step 2.
The above steps are optional since we provide these files. Download cache.zip
from link and unzip it at ipynb/
. The specific instructions are introduced in each notebook.
If you do not presume ground-truth cluster number K, then you may need to execute the last step explained below.
After the structural semantic alignment labels are computed, we conduct self-training with per-epoch structural semantic alignment and per-iteration instance-wise semantic alignment updation.
Here, we provide the training script train_sssa.sh
to train our
The detailed hyperparameters are explained below:
--vocab_name 'in21k' ### vocabulary name, choices=['in21k', 'in21k-L']
--n_iter_cluster_vote 3 ### iteration number of our CVPR algorithm per epoch
--w_ins 1.0 ### weight of instance semantic alignment
--uk 'False' ### whether is unknown K (i.e., the cluster number is unknown), only for LV17, NL26, ET13, ET30
--epoch_init_warmup 2 ### number of epoch using initial structural semantic alignment
--w_str 0.25 ### weight of structural semantic alignment
--oov_dataset 'False' ### True for out-of-vocabulary datasets
--n_sampled_classes 100 ### set 100 for ImageNet-100 dataset
We can iteratively estimate the cluster number cluster-estimation/estimate_k_save.sh
script. At each iteration, we specify the lowerbound cluster number with --min_classes
and upperbound cluster number with --max_classes
. To alleviate computation overheads, we only conduct cluster estimation by a subset of --ratio_rand_sample 0.5
percent data points and with at least --min_rand_sample 30000
data points.
In our paper, we conduct three iterations in total: (1) estimate between [50, 2000] and get
The estimation is based on extracted CLIP visual features. We provide our pre-extracted files in ipynb/cache/features/<.npy file>
for the --vfeatures_fpath
argument. You can also extract your custom features using our ipynb/Clustering.ipynb
notebook.
The inference script is explained as below:
### the cluster number is infered with
python -m a_estimate_k \
--max_classes 476 \ ### specify the upperbound cluster number
--min_classes 50 \ ### specify the lowerbound cluster number
--vfeatures_fpath <PARENT>/ipynb/cache/features/vfeatures-${dataset_name}.npy \ ### extracted normalized visual features from the specified dataset
--search_mode other \ ### we denote our iterative method as 'other'
--ratio_rand_sample 0.5 \ ### inference on the subset of data to speed up
--min_rand_sample 30000 \ ### subset should be at least 30000 data points
--method_kmeans 'kmeans' \
--save_prediction ${dataset_name} ### save file number
The estimated Clustering.ipynb
and then for our CVPR algorithm with the script CVPR-algo.ipynb
.
If you find our work interesting or useful, please consider ⭐ starring ⭐ our repo or citing our papers. ❤️ Thanks ~
@article{zhang2023towards,
title={Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment},
author={Zhang, Sheng and Naseer, Muzammal and Chen, Guangyi and Shen, Zhiqiang and Khan, Salman and Zhang, Kun and Khan, Fahad},
journal={arXiv preprint arXiv:2308.12960},
year={2023}
}
Since our codes are built upon Masked Unsupervised Self-training for Zero-shot Image Classification, we here acknowledge their efforts.