Morphologically-Aware Vocabulary Reduction of Word Embeddings

This repository contains the code for training SubText, as well as scripts for evaluating the performance on various tasks.

We also provide a pre-trained version of SubText, as well as the Word2Vec (Mikolov et al., 2013) embeddings used for training.

Via Git LFS

git lfs pull -I "pretrained/pretrained.tar.gz"
cd pretrained/
tar -xzvf pretrained.tar.gz

Via Dropbox

cd pretrained/
wget https://www.dropbox.com/s/ha45ck0hjefdbme/pretrained.tar.gz
tar -xzvf pretrained.tar.gz

Usage

We recommend creating a conda environment named subtext with the provided environment.yml:

conda env create -f environment.yml

All scripts should be run from the src directory:

cd src/

Training Subtext

The following commands trains SubText (to 1000 wordpieces) on embeddings in a Word2Vec-format file (REPLACE_WITH_WV), and saves them in the given directory (REPLACE_WITH_DIR):

python subtext/train_subtext.py --embeddings REPLACE_WITH_WV --out_dir REPLACE_WITH_DIR --wp_size 1000

SubText of arbitrary sizes can then be generated from the training record file (*.rec) with the following commands:

python subtext/recon_records.py --records REPLACE_WITH_RECORD

Experiments

The evaluation code can be found in the experiments directory, and can be run in a similar fashion:

python experiments/1_eval_word_reconstruction.py --piece_embs REPLACE_WITH_WV

Be sure to update the arguments with the appropriate values (use -h to check arguments for each experiment):

python experiments/1_eval_word_reconstruction.py -h

To run other experiments, replace 1_eval_word_reconstruction.py with the desired experiment.

Resources

Scripts for downloading and processing evaluation datasets can be found in data/.

Word2Vec (English)

Word2Vec was trained on English Wikipedia, using the provided CirrusSearch dumps. The pretrained Word2Vec embeddings were trained on enwiki-20211011-cirrussearch-content_masked.json.

Polyglot (Multilingual Word2Vec)

Pretrained language-specific embeddings are available from the Polyglot project page.

20News

We use the 20news-bydate.tar.gz dataset from the 20Newsgroups project page

arXiv

We use the arxivData.json dataset. Details about the dataset can be found on the Kaggle dataset page.

BioASQ

We use the Task A Training (v.2019) dataset, which is only accessible after registering as a participant on the BioASQ Challenge website.

CLINC150

We use the data_full.json dataset from the project GitHub repository.

Citiation

Our paper can be cited in the following formats:

APA

Chia, C., Tkachenko, M., & Lauw, H. (2022). Morphologically-Aware Vocabulary Reduction of Word Embeddings. In 21st IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

Bibtex

@inproceedings{chia2022morphological,
    title={Morphologically-Aware Vocabulary Reduction of Word Embeddings},
    author={Chia, Chong Cher and Tkachenko, Maksim and Lauw, Hady W},
    booktitle={21st IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology},
    year={2022},
    organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
img		img
pretrained		pretrained
src		src
.gitattributes		.gitattributes
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Morphologically-Aware Vocabulary Reduction of Word Embeddings

Usage

Training Subtext

Experiments

Resources

Word2Vec (English)

Polyglot (Multilingual Word2Vec)

20News

arXiv

BioASQ

CLINC150

Citiation

APA

Bibtex

About

Releases

Packages

Languages

PreferredAI/subtext

Folders and files

Latest commit

History

Repository files navigation

Morphologically-Aware Vocabulary Reduction of Word Embeddings

Usage

Training Subtext

Experiments

Resources

Word2Vec (English)

Polyglot (Multilingual Word2Vec)

20News

arXiv

BioASQ

CLINC150

Citiation

APA

Bibtex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages