Skip to content

Morphologically-Aware Vocabulary Reduction of Word Embeddings

Notifications You must be signed in to change notification settings

PreferredAI/subtext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preferred.AI

Morphologically-Aware Vocabulary Reduction of Word Embeddings

This repository contains the code for training SubText, as well as scripts for evaluating the performance on various tasks.

We also provide a pre-trained version of SubText, as well as the Word2Vec (Mikolov et al., 2013) embeddings used for training.

Via Git LFS

git lfs pull -I "pretrained/pretrained.tar.gz"
cd pretrained/
tar -xzvf pretrained.tar.gz

Via Dropbox

cd pretrained/
wget https://www.dropbox.com/s/ha45ck0hjefdbme/pretrained.tar.gz
tar -xzvf pretrained.tar.gz

Usage

We recommend creating a conda environment named subtext with the provided environment.yml:

conda env create -f environment.yml

All scripts should be run from the src directory:

cd src/

Training Subtext

The following commands trains SubText (to 1000 wordpieces) on embeddings in a Word2Vec-format file (REPLACE_WITH_WV), and saves them in the given directory (REPLACE_WITH_DIR):

python subtext/train_subtext.py --embeddings REPLACE_WITH_WV --out_dir REPLACE_WITH_DIR --wp_size 1000 

SubText of arbitrary sizes can then be generated from the training record file (*.rec) with the following commands:

python subtext/recon_records.py --records REPLACE_WITH_RECORD 

Experiments

The evaluation code can be found in the experiments directory, and can be run in a similar fashion:

python experiments/1_eval_word_reconstruction.py --piece_embs REPLACE_WITH_WV

Be sure to update the arguments with the appropriate values (use -h to check arguments for each experiment):

python experiments/1_eval_word_reconstruction.py -h

To run other experiments, replace 1_eval_word_reconstruction.py with the desired experiment.

Resources

Scripts for downloading and processing evaluation datasets can be found in data/.

Word2Vec (English)

Word2Vec was trained on English Wikipedia, using the provided CirrusSearch dumps. The pretrained Word2Vec embeddings were trained on enwiki-20211011-cirrussearch-content_masked.json.

Polyglot (Multilingual Word2Vec)

Pretrained language-specific embeddings are available from the Polyglot project page.

20News

We use the 20news-bydate.tar.gz dataset from the 20Newsgroups project page

arXiv

We use the arxivData.json dataset. Details about the dataset can be found on the Kaggle dataset page.

BioASQ

We use the Task A Training (v.2019) dataset, which is only accessible after registering as a participant on the BioASQ Challenge website.

CLINC150

We use the data_full.json dataset from the project GitHub repository.

Citiation

Our paper can be cited in the following formats:

APA

Chia, C., Tkachenko, M., & Lauw, H. (2022). Morphologically-Aware Vocabulary Reduction of Word Embeddings. In 21st IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

Bibtex

@inproceedings{chia2022morphological,
    title={Morphologically-Aware Vocabulary Reduction of Word Embeddings},
    author={Chia, Chong Cher and Tkachenko, Maksim and Lauw, Hady W},
    booktitle={21st IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology},
    year={2022},
    organization={IEEE}
}

About

Morphologically-Aware Vocabulary Reduction of Word Embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published