Using PhoBERT pretrain, vTLDR provides a seq2seq model that performs the mini downstream task of Vietnamese news summarization. A GUI for users to input plain text or news URL (only supports tuoitre.vn domain at the moment) and output respective summary.
Notebook train_test_infer guide: Colab App demo video: Youtube
Clone repo
git clone https://github.com/ngfuong/vTLDR
cd vTLDR
Create virual environment and install requirements
python -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt
Download checkpoint from Drive and put into training
folder.
Make sure vncorenlp
folder have the .jar
file and its word segmentation component. You can download them as follow
mkdir -p vncorenlp/models/wordsegmenter
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr
mv VnCoreNLP-1.1.1.jar vncorenlp/
mv vi-vocab vncorenlp/models/wordsegmenter/
mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/
To use app, launch
python app.py
Plain text or news URL (tuoitre.vn domain) are supported as input.
For training, download Vietnews dataset
wget https://github.com/ThanhChinhBK/vietnews/archive/master.zip
unzip master.zip
Make sure the dataset is processed, refer to data_preprocess.py
.
Train with default config.yaml
file. Edit the file to change training configs.
python train.py
To test
python test.py
- Nguyen, Van-Hau & Nguyen, Thanh-Chinh & Nguyen, Minh-Tien & Hoai, Nguyen. (2019). VNDS: A Vietnamese Dataset for Summarization. 375-380. 10.1109/NICS48868.2019.9023886.
- Rothe, Sascha & Narayan, Shashi & Severyn, Aliaksei. (2020). Leveraging Pre-trained Checkpoints for Sequence Generation Tasks. Transactions of the Association for Computational Linguistics. 8. 264-280. 10.1162/tacl_a_00313.
- Nguyen, Dat Quoc & Nguyen, Anh. (2020). PhoBERT: Pre-trained language models for Vietnamese. 1037-1042. 10.18653/v1/2020.findings-emnlp.92.
- ngockhanh5110, nlp-vietnamese-text-summarization, (2021), GitHub repository, https://github.com/ngockhanh5110/nlp-vietnamese-text-summarization