Chinese Pre-Trained XLNet

This project provides a XLNet pre-training model for Chinese, which aims to enrich Chinese natural language processing resources and provide a variety of Chinese pre-training model selection. We welcome all experts and scholars to download and use this model.

This project is based on CMU/Google official XLNet: https://github.com/zihangdai/xlnet

More resources by HFL: https://github.com/ymcui/HFL-Anthology

News

Mar 28, 2023 We open-sourced Chinese LLaMA&Alpaca LLMs, which can be quickly deployed on PC. Check: https://github.com/ymcui/Chinese-LLaMA-Alpaca

2022/10/29 We release a new pre-trained model called LERT, check https://github.com/ymcui/LERT/

2022/3/30 We release a new pre-trained model called PERT, check https://github.com/ymcui/PERT

2021/12/17 We release a model pruning toolkit - TextPruner, check https://github.com/airaria/TextPruner

2021/1/27 All models support TensorFlow 2 now. Please use transformers library to access them or download from https://huggingface.co/hfl

2020/9/15 Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" is accepted to Findings of EMNLP as a long paper.

2020/8/27 We are happy to announce that our model is on top of GLUE benchmark, check leaderboard.

Past News

2020/2/26 We release a knowledge distillation toolkit [TextBrewer](https://github.com/airaria/TextBrewer)

2019/12/19 The models in this repository now can be easily accessed through Huggingface-Transformers, check Quick Load

2019/9/5 XLNet-base has been released. Check Download

2019/8/19 We provide pre-trained Chinese XLNet-mid model, which was trained on large-scale data. Check Download

Guide

Section	Description
Download	Download links for Chinese XLNet
Baselines	Baseline results for several Chinese NLP datasets (partial)
Pre-training Details	Details for pre-training
Fine-tuning Details	Details for fine-tuning
FAQ	Frequently Asked Questions
Citation	Citation

Download

XLNet-mid：24-layer, 768-hidden, 12-heads, 209M parameters
XLNet-base：12-layer, 768-hidden, 12-heads, 117M parameters

Model	Data	Google Drive	Baidu Disk
`XLNet-mid, Chinese`	Wikipedia+Extended data^[1]	TensorFlow PyTorch	TensorFlow（pw:2jv2）
`XLNet-base, Chinese`	Wikipedia+Extended data^[1]	TensorFlow PyTorch	TensorFlow（pw:ge7w）

[1] Extended data includes: baike, news, QA data, with 5.4B words in total, which is exactly the same with BERT-wwm-ext.

PyTorch Version

If you need these models in PyTorch,

Convert TensorFlow checkpoint into PyTorch, using 🤗Transformers
Download from https://huggingface.co/hfl

Steps: select one of the model in the page above → click "list all files in model" at the end of the model page → download bin/json files from the pop-up window

Note

The whole zip package roughly takes ~800M for XLNet-mid model. ZIP package includes the following files:

chinese_xlnet_mid_L-24_H-768_A-12.zip
    |- xlnet_model.ckpt      # Model Weights
    |- xlnet_model.meta      # Meta info
    |- xlnet_model.index     # Index info
    |- xlnet_config.json     # Config file
    |- spiece.model          # Vocabulary

Quick Load

With Huggingface-Transformers, the models above could be easily accessed and loaded through the following codes.

tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModel.from_pretrained("MODEL_NAME")

The actual model and its MODEL_NAME are listed below.

Original Model	MODEL_NAME
XLNet-mid	hfl/chinese-xlnet-mid
XLNet-base	hfl/chinese-xlnet-base

Baselines

We conduct experiments on several Chinese NLP data, and compare the performance among BERT, BERT-wwm, BERT-wwm-ext, XLNet-base, and XLNet-mid. The results of BERT/BERT-wwm/BERT-wwm-ext were extracted from Chinese BERT-wwm.

Note: To ensure the stability of the results, we run 10 times for each experiment and report maximum and average scores.

Average scores are in brackets, and max performances are the numbers that out of brackets.

CMRC 2018

CMRC 2018 dataset is released by Joint Laboratory of HIT and iFLYTEK Research. The model should answer the questions based on the given passage, which is identical to SQuAD. Evaluation Metrics: EM / F1

Model	Development	Test	Challenge
BERT	65.5 (64.4) / 84.5 (84.0)	70.0 (68.7) / 87.0 (86.3)	18.6 (17.0) / 43.3 (41.3)
BERT-wwm	66.3 (65.0) / 85.6 (84.7)	70.5 (69.1) / 87.4 (86.7)	21.0 (19.3) / 47.0 (43.9)
BERT-wwm-ext	67.1 (65.6) / 85.7 (85.0)	71.4 (70.0) / 87.7 (87.0)	24.0 (20.0) / 47.3 (44.6)
XLNet-base	65.2 (63.0) / 86.9 (85.9)	67.0 (65.8) / 87.2 (86.8)	25.0 (22.7) / 51.3 (49.5)
XLNet-mid	66.8 (66.3) / 88.4 (88.1)	69.3 (68.5) / 89.2 (88.8)	29.1 (27.1) / 55.8 (54.9)

DRCD

DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese. Evaluation Metrics: EM / F1

Model	Development	Test
BERT	83.1 (82.7) / 89.9 (89.6)	82.2 (81.6) / 89.2 (88.8)
BERT-wwm	84.3 (83.4) / 90.5 (90.2)	82.8 (81.8) / 89.7 (89.0)
BERT-wwm-ext	85.0 (84.5) / 91.2 (90.9)	83.6 (83.0) / 90.4 (89.9)
XLNet-base	83.8 (83.2) / 92.3 (92.0)	83.5 (82.8) / 92.2 (91.8)
XLNet-mid	85.3 (84.9) / 93.5 (93.3)	85.5 (84.8) / 93.6 (93.2)

Sentiment Classification: ChnSentiCorp

We use ChnSentiCorp data for sentiment classification, which is a binary classification task. Evaluation Metrics: Accuracy

Model	Development	Test
BERT	94.7 (94.3)	95.0 (94.7)
BERT-wwm	95.1 (94.5)	95.4 (95.0)
XLNet-base
XLNet-mid	95.8 (95.2)	95.4 (94.9)

Pre-training Details

We take XLNet-mid for example to demonstrate the pre-training details.

Generate Vocabulary

Following official tutorial of XLNet, we need to generate vocabulary using Sentence Piece. In this project, we use a vocabulary of 32000 words. The rest of the parameters are identical to the default settings.

spm_train \
	--input=wiki.zh.txt \
	--model_prefix=sp10m.cased.v3 \
	--vocab_size=32000 \
	--character_coverage=0.99995 \
	--model_type=unigram \
	--control_symbols=\<cls\>,\<sep\>,\<pad\>,\<mask\>,\<eod\> \
	--user_defined_symbols=\<eop\>,.,\(,\),\",-,–,£,€ \
	--shuffle_input_sentence \
	--input_sentence_size=10000000

Generate tf_records

We use raw text files to generate tf_records.

SAVE_DIR=./output_b32
INPUT=./data/*.proc.txt

python data_utils.py \
	--bsz_per_host=32 \
	--num_core_per_host=8 \
	--seq_len=512 \
	--reuse_len=256 \
	--input_glob=${INPUT} \
	--save_dir=${SAVE_DIR} \
	--num_passes=20 \
	--bi_data=True \
	--sp_path=spiece.model \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85 \
	--uncased=False \
	--num_task=10 \
	--task=1

Pre-training

Now we can pre-train our Chinese XLNet. Note that, XLNet-mid is named because of it only increases the number of Transformers (from 12 to 24).

DATA=YOUR_GS_BUCKET_PATH_TO_TFRECORDS
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
TPU_NAME=v3-xlnet
TPU_ZONE=us-central1-b

python train.py \
	--record_info_dir=$DATA \
	--model_dir=$MODEL_DIR \
	--train_batch_size=32 \
	--seq_len=512 \
	--reuse_len=256 \
	--mem_len=384 \
	--perm_size=256 \
	--n_layer=24 \
	--d_model=768 \
	--d_embed=768 \
	--n_head=12 \
	--d_head=64 \
	--d_inner=3072 \
	--untie_r=True \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85 \
	--uncased=False \
	--train_steps=2000000 \
	--save_steps=20000 \
	--warmup_steps=20000 \
	--max_save=20 \
	--weight_decay=0.01 \
	--adam_epsilon=1e-6 \
	--learning_rate=1e-4 \
	--dropout=0.1 \
	--dropatt=0.1 \
	--tpu=$TPU_NAME \
	--tpu_zone=$TPU_ZONE \
	--use_tpu=True

Fine-tuning Details

We use Google Cloud TPU v2 (64G HBM) for fine-tuning.

CMRC 2018

For reading comprehension tasks, we first need to generate tf_records data. Please infer official tutorial of XLNet: SQuAD 2.0.

XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_cmrc_drcd.py \
	--spiece_model_file=./spiece.model \
	--model_config_path=${XLNET_DIR}/xlnet_config.json \
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
	--tpu_zone=${TPU_ZONE} \
	--use_tpu=True \
	--tpu=${TPU_NAME} \
	--num_hosts=1 \
	--num_core_per_host=8 \
	--output_dir=${DATA_DIR} \
	--model_dir=${MODEL_DIR} \
	--predict_dir=${MODEL_DIR}/eval \
	--train_file=${DATA_DIR}/cmrc2018_train.json \
	--predict_file=${DATA_DIR}/cmrc2018_dev.json \
	--uncased=False \
	--max_answer_length=40 \
	--max_seq_length=512 \
	--do_train=True \
	--train_batch_size=16 \
	--do_predict=True \
	--predict_batch_size=16 \
	--learning_rate=3e-5 \
	--adam_epsilon=1e-6 \
	--iterations=1000 \
	--save_steps=2000 \
	--train_steps=2400 \
	--warmup_steps=240

DRCD

XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_cmrc_drcd.py \
	--spiece_model_file=./spiece.model \
	--model_config_path=${XLNET_DIR}/xlnet_config.json \
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
	--tpu_zone=${TPU_ZONE} \
	--use_tpu=True \
	--tpu=${TPU_NAME} \
	--num_hosts=1 \
	--num_core_per_host=8 \
	--output_dir=${DATA_DIR} \
	--model_dir=${MODEL_DIR} \
	--predict_dir=${MODEL_DIR}/eval \
	--train_file=${DATA_DIR}/DRCD_training.json \
	--predict_file=${DATA_DIR}/DRCD_dev.json \
	--uncased=False \
	--max_answer_length=30 \
	--max_seq_length=512 \
	--do_train=True \
	--train_batch_size=16 \
	--do_predict=True \
	--predict_batch_size=16 \
	--learning_rate=3e-5 \
	--adam_epsilon=1e-6 \
	--iterations=1000 \
	--save_steps=2000 \
	--train_steps=3600 \
	--warmup_steps=360

ChnSentiCorp

Different from reading comprehension task, we do not need to generate tf_records in advance.

XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_classifier.py \
	--spiece_model_file=./spiece.model \
	--model_config_path=${XLNET_DIR}/xlnet_config.json \
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
	--task_name=csc \
	--do_train=True \
	--do_eval=True \
	--eval_all_ckpt=False \
	--uncased=False \
	--data_dir=${RAW_DIR} \
	--output_dir=${DATA_DIR} \
	--model_dir=${MODEL_DIR} \
	--train_batch_size=48 \
	--eval_batch_size=48 \
	--num_hosts=1 \
	--num_core_per_host=8 \
	--num_train_epochs=3 \
	--max_seq_length=256 \
	--learning_rate=3e-5 \
	--save_steps=5000 \
	--use_tpu=True \
	--tpu=${TPU_NAME} \
	--tpu_zone=${TPU_ZONE}

FAQ

Q: Will you release larger data?
A: It depends.

Q: Bad results on some datasets?
A: Please use other pre-trained model or continue to do pre-training on your own data.

Q: Will you publish the data used in pre-training?
A: Nope, copyright is the biggest concern.

Q: How long did you take to train XLNet-mid?
A: We use Cloud TPU v3 (128G HBM) to train 2M steps with batch size of 32, which takes roughly three weeks.

Q: Does XLNet perform better than BERT in most of the times?
A: Seems to be right. At least the tasks we tried above are substantially better than BERTs.

Citation

If you find the technical report or resource is useful, please cite the following technical report in your paper. https://www.aclweb.org/anthology/2020.findings-emnlp.58

@inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

Acknowledgement

Authors: Yiming Cui (Joint Laboratory of HIT and iFLYTEK Research, HFL), Wanxiang Che (Harbin Institute of Technology), Ting Liu (Harbin Institute of Technology), Shijin Wang (iFLYTEK), Guoping Hu (iFLYTEK)

This project is supported by Google TensorFlow Research Cloud (TFRC) Program。

We also refered to the following repository:

XLNet: https://github.com/zihangdai/xlnet
Malaya: https://github.com/huseinzol05/Malaya/tree/master/xlnet
Korean XLNet: https://github.com/yeontaek/XLNET-Korean-Model

Disclaimer

This is NOT a project by XLNet official. Also, this is NOT an official product by HIT and iFLYTEK.

The experiments only represent the empirical results in certain conditions and should not be regarded as the nature of the respective models. The results may vary using different random seeds, computing devices, etc.

The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks. Users are free to use anythings in this repository within the scope of Apache-2.0 licence. However, we are not responsible for direct or indirect losses that was caused by using the content in this project.

Issues

If there is any problem, please submit a GitHub Issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Chinese Pre-Trained XLNet

News

Guide

Download

PyTorch Version

Note

Quick Load

Baselines

CMRC 2018

DRCD

Sentiment Classification: ChnSentiCorp

Pre-training Details

Generate Vocabulary

Generate tf_records

Pre-training

Fine-tuning Details

CMRC 2018

DRCD

ChnSentiCorp

FAQ

Citation

Acknowledgement

Disclaimer

Issues

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Chinese Pre-Trained XLNet

News

Guide

Download

PyTorch Version

Note

Quick Load

Baselines

CMRC 2018

DRCD

Sentiment Classification: ChnSentiCorp

Pre-training Details

Generate Vocabulary

Generate tf_records

Pre-training

Fine-tuning Details

CMRC 2018

DRCD

ChnSentiCorp

FAQ

Citation

Acknowledgement

Disclaimer

Issues