This repository contains the code for training SubText, as well as scripts for evaluating the performance on various tasks.
We also provide a pre-trained version of SubText, as well as the Word2Vec (Mikolov et al., 2013) embeddings used for training.
Via Git LFS
git lfs pull -I "pretrained/pretrained.tar.gz"
cd pretrained/
tar -xzvf pretrained.tar.gz
Via Dropbox
cd pretrained/
wget https://www.dropbox.com/s/ha45ck0hjefdbme/pretrained.tar.gz
tar -xzvf pretrained.tar.gz
We recommend creating a conda environment named subtext
with the provided environment.yml
:
conda env create -f environment.yml
All scripts should be run from the src
directory:
cd src/
The following commands trains SubText (to 1000 wordpieces) on embeddings in a Word2Vec-format file (REPLACE_WITH_WV
), and saves them in the given directory (REPLACE_WITH_DIR
):
python subtext/train_subtext.py --embeddings REPLACE_WITH_WV --out_dir REPLACE_WITH_DIR --wp_size 1000
SubText of arbitrary sizes can then be generated from the training record file (*.rec
) with the following commands:
python subtext/recon_records.py --records REPLACE_WITH_RECORD
The evaluation code can be found in the experiments
directory, and can be run in a similar fashion:
python experiments/1_eval_word_reconstruction.py --piece_embs REPLACE_WITH_WV
Be sure to update the arguments with the appropriate values (use -h
to check arguments for each experiment):
python experiments/1_eval_word_reconstruction.py -h
To run other experiments, replace 1_eval_word_reconstruction.py
with the desired experiment.
Scripts for downloading and processing evaluation datasets can be found in data/
.
Word2Vec was trained on English Wikipedia, using the provided CirrusSearch dumps.
The pretrained Word2Vec embeddings were trained on enwiki-20211011-cirrussearch-content_masked.json
.
Pretrained language-specific embeddings are available from the Polyglot project page.
We use the 20news-bydate.tar.gz
dataset from the 20Newsgroups project page
We use the arxivData.json
dataset. Details about the dataset can be found on the Kaggle dataset page.
We use the Task A Training (v.2019) dataset, which is only accessible after registering as a participant on the BioASQ Challenge website.
We use the data_full.json
dataset from the project GitHub repository.
Our paper can be cited in the following formats:
Chia, C., Tkachenko, M., & Lauw, H. (2022). Morphologically-Aware Vocabulary Reduction of Word Embeddings. In 21st IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.
@inproceedings{chia2022morphological,
title={Morphologically-Aware Vocabulary Reduction of Word Embeddings},
author={Chia, Chong Cher and Tkachenko, Maksim and Lauw, Hady W},
booktitle={21st IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology},
year={2022},
organization={IEEE}
}