Text Summarization using Pointer Attention Models
This is a Pytorch implementation of Get To The Point: Summarization with Pointer-Generator Networks by See et. al.
You require :
- Pytorch v0.2 with CUDA support
- Visdom visualization package for easy monitoring of training progress in web browser
- tqdm for terminal-level progress updates
Download the fully-preprocessed data splits here and save yourself the trouble of downloading CoreNLP and subsequent tokenization.
This implementation only uses the Finised_Files
directory containing train.bin
, val.bin
, test.bin
splits. Move these files to the finished_files
directory.
The only other file vocabulary.bin
has already been provided in the finished_files
in this repo.
Alternatively, follow the instructions here to download and preprocess the dataset.
Use the given preprocess.py
instead script to generate the data splits as this version does not have Tensorflow dependencies.
The original hyperparameters as described in the paper are memory intensive, so this implementation uses a smaller RNN hidden_size
as the default setting. All other hyperparameters are kept the same.
You can trade-off vocabulary_size
, batch_size
, hidden_size
, max_abstract_size
and max_article_size
to achieve your memory budget.
-
Fire up a visdom server :
python -m visdom.server
-
To train using default settings, from the repo's root directory :
CUDA_VISIBLE_DEVICES=0 python main.py
-
Monitor the training progress by going to
127.0.0.1:8097
in your web browser or the remote URL is you're executing your code on SSH
Configurations can be changed using command line options.
python main.py --help
to get a list of all options.
The model is evaluated periodically during training on a sample from test.bin
and decoding is done using beam search. The model is also saved after every epoch in Saved-Models
To bootstrap with pre-trained embeddings, you will need to obtain pre-trained Glove/Word2Vec embeddings for words in your vocabulary. OOV words can be assigned a random value. Save this as a Pytorch Tensor embeds.pkl
and make sure the size of vocabulary matches size of tensor.
The default setting is initialize with random word embeddings since that has been reported to perform better.