SemSup-XC: Semantic Supervision for Extreme Classification

Pranjal Aggarwal, Ameet Deshpande, Karthik Narasimhan

Abstract

Extreme classification (XC) involves classifying over large numbers of classes (thousands to millions), with real-world applications like news article classification and e-commerce product tagging. The zero-shot version of this task requires generalization to novel classes without additional supervision, like a new class "fidget spinner" for e-commerce product tagging. In this paper, we develop SemSup-XC, a model that achieves state-of-the-art zero-shot (ZS) and few-shot (FS) performance on three XC datasets spanning the domains of law, e-commerce, and Wikipedia. SemSup-XC uses automatically collected semantic class descriptions to represent classes ("fidget spinner" can be described as "A spinning toy for stress relief") and enables better generalization through our proposed hybrid matching module (Relaxed-COIL) which matches input instances to class descriptions using both (1) semantic similarity and (2) lexical similarity over contextual representations of similar tokens. Trained with contrastive learning, SemSup-XC significantly outperforms baselines and establishes state-of-the-art performance on all three datasets, by 5-12 precision@1 points on zero-shot and >10 precision@1 points on few-shot (K = 1), with similar gains for recall@10. Our ablation studies highlight the relative importance of our hybrid matching module (upto 2 P@1 improvement on AmazonCat) and automatically collected class descriptions (upto 5 P@1 improvement on AmazonCat).

Setup

First clone the repository and install dependencies:

git clone https://github.com/princeton-nlp/semsup-xc.git
pip install -r requirements.txt

Inside the semsup-xc folder, download pre-processed datasets and scraped class descriptions from here. Unzip the downloaded file into datasets folder.

Running

Training

You need to run python main.py <config_file> <output_dir> to train both zero-shot and few-shot models on both datasets. See configs folder for list of all relevant config files.

Evaluation

For all datasets, you can directly run main.py script by updating config file by changing pretrained_model parameter and and setting do_train set to False. You can also adjust random_sample parameter to adjust the number of samples to evaluate on. For ensembling results with TF-IDF, use the Evaluator.ipynb script. Previous method is slow, and memory hungry. For faster inference in Amazon and Wikipedia datasets, use: bash scripts/fastEval{DSET}.sh <config_file> <checkpoint_path>.

Trained Models

Pre-trained models can be downloaded from here.

Citing SemSup-XC

@article{aggarwal2023semsupxc,
  title   = {SemSup-XC: Semantic Supervision for Zero and Few-shot Extreme Classification},
  author  = {Pranjal Aggarwal and Ameet Deshpande and Karthik Narasimhan},
  year    = {2023},
  journal = {arXiv preprint arXiv: Arxiv-2301.11309}
}

LICENSE

SemSup-XC is MIT licensed, as found in the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemSup-XC: Semantic Supervision for Extreme Classification

Abstract

Setup

Running

Training

Evaluation

Trained Models

Citing SemSup-XC

LICENSE

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
eval		eval
res		res
scripts		scripts
src		src
.gitignore		.gitignore
Evaluator.ipynb		Evaluator.ipynb
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

princeton-nlp/semsup-xc

Folders and files

Latest commit

History

Repository files navigation

SemSup-XC: Semantic Supervision for Extreme Classification

Abstract

Setup

Running

Training

Evaluation

Trained Models

Citing SemSup-XC

LICENSE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages