The official implementation of the paper "RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL"(EMNLP 2022)
This is the official implementation of the following paper:
Jiexing Qi and Jingyao Tang and Ziwei He and Xiangpeng Wan and Yu Cheng and Chenghu Zhou and Xinbing Wang and Quanshi Zhang and Zhouhan Lin. RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
If you use this code, please cite:
@article{Qi2022RASATIR,
title={RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL},
author={Jiexing Qi and Jingyao Tang and Ziwei He and Xiangpeng Wan and Yu Cheng and Chenghu Zhou and Xinbing Wang and Quanshi Zhang and Zhouhan Lin},
journal={ArXiv},
year={2022},
volume={abs/2205.06983}
}
This repository uses git submodules. Clone it like this:
$ git clone https://github.com/JiexingQi/RASAT.git
$ cd RASAT
$ git submodule update --init --recursive
Before running the code, you should download dataset files.
First, you should create a dictionary like this:
mkdir -p dataset_files/ori_dataset
And then you need to download the dataset file to dataset_files/ and just keep it in zip format. The download links are here:
Then unzip those dataset files into dataset_files/ori_dataset. Both files in zip format and unzip format is needed:
unzip dataset_files/spider.zip -d dataset_files/ori_dataset/
unzip dataset_files/cosql_dataset.zip -d dataset_files/ori_dataset/
unzip dataset_files/sparc.zip -d dataset_files/ori_dataset/
We recommend you just use the generated coreference resolution files. It just needs you run
unzip preprocessed_dataset.zip -d ./dataset_files/
If you want to generate these coreference resolution files by yourself, you could create a new conda environment to install coreferee library since it may have a version conflict with other libraries. The install commands are as follows:
conda create -n coreferee python=3.9.7
conda activate coreferee
bash run_corefer_processing.sh
and you can just assign the dataset name and the corresponding split, such as
python3 get_coref.py --input_path ./cosql_dataset/sql_state_tracking/cosql_dev.json --output_path ./dev_coref.json --dataset_name cosql --mode dev
The best performance is achieved by exploiting PICARD[1], and if you want to reproduce it, we recommend you use Docker.
You can simply use
make eval
to start a new docker container for an interaction terminal that supports PICARD.
Since the docker environment doesn't have stanza, so you should run these commands before training or evaluting:
pip install stanza
python3 seq2seq/stanza_downloader.py
Note:We only use PICARD for seperately evalutaion.
If Docker is not available to you, you could also run it in a python 3.9.7 environment
conda create -n rasat python=3.9.7
conda activate rasat
pip3 install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
pip install -r requirements.txt
However, you could not use PICARD in that way.
**Please Note: the version of stanza must keep 1.3.0, other versions will lead to error. **
You can simply run these code like this:
- Single-GPU
CUDA_VISIBLE_DEVICES="0" python3 seq2seq/run_seq2seq.py configs/sparc/train_sparc_rasat_small.json
- Multi-GPU
CUDA_VISIBLE_DEVICES="2,3" python3 -m torch.distributed.launch --nnodes=1 --nproc_per_node=2 seq2seq/run_seq2seq.py configs/sparc/train_sparc_rasat_small.json
and you should set --nproc_per_node=#gpus to make full use of all GPUs. A recommend total_batch_size = #gpus * gradient_accumulation_steps * per_device_train_batch_size is 2048.
You can simply run these codes:
CUDA_VISIBLE_DEVICES="2" python3 seq2seq/eval_run_seq2seq.py configs/cosql/eval_cosql_rasat_576.json
Notice:If you use Docker for evaluation, you may need to change the filemode for these dictionary before starting a new docker container:
chmod -R 777 seq2seq/
chmod -R 777 dataset_files/
The models shown below use database content, and the corresponding column like "edge_type", and "use_coref" are parameters set in config.json. All these model checkpoints are available in Huggingface.
model | edge_type | use_dependency | use_coref | QEM/IEM(Dev) | QEX/IEX(Dev) | QEM/IEM(Test) | QEX/IEX(Test) |
---|---|---|---|---|---|---|---|
Jiexing/cosql_add_coref_t5_3b_order_0519_ckpt-576 | Default | FALSE | TRUE | 56.1/25.9 | 63.2/34.1 | - | - |
+ PICARD | Default | FALSE | TRUE | 58.6/27.0 | 67.0/39.6 | 53.6/24.1 | 64.9/34.3 |
Jiexing/cosql_add_coref_t5_3b_order_0519_ckpt-2624 | Default | FALSE | TRUE | 56.4/25.6 | 63.1/34.8 | - | - |
+ PICARD | Default | FALSE | TRUE | 57.9/26.3 | 66.1/38.6 | 55.7/26.5 | 66.3/37.4 |
model | edge_type | use_dependency | use_coref | QEM/IEM(Dev) | QEX/IEX(Dev) | QEM/IEM(Test) | QEX/IEX(Test) |
---|---|---|---|---|---|---|---|
Jiexing/sparc_add_coref_t5_3b_order_0514_ckpt-4224 | Default | FALSE | TRUE | 65.0/45.5 | 69.9/50.7 | - | - |
+ PICARD | Default | FALSE | TRUE | 67.5/46.9 | 73.2/53.8 | 67.7/44.9 | 74.0/52.6 |
Jiexing/sparc_add_coref_t5_3b_order_0514_ckpt-5696 | Default | FALSE | TRUE | 63.7/47.4 | 68.1/50.2 | - | - |
+ PICARD | Default | FALSE | TRUE | 67.1/49.3 | 72.5/53.6 | 67.3/45.2 | 73.6/52.6 |
model | edge_type | use_dependency | use_coref | EM(Dev) | EX(Dev) | EM(Test) | EX(Test) |
---|---|---|---|---|---|---|---|
Jiexing/spider_relation_t5_3b-2624 | Default | FALSE | FALSE | 72 | 76.6 | - | - |
+ PICARD | Default | FALSE | FALSE | 74.7 | 80.5 | 70.6 | 75.5 |
Jiexing/spider_relation_t5_3b-4160 | Default | FALSE | FALSE | 72.6 | 76.6 | - | - |
+ PICARD | Default | FALSE | FALSE | 75.3 | 78.3 | 70.9 | 74.5 |
We would like to thank Tao Yu, Hongjin Su, and Yusen Zhang for running evaluations on our submitted models. We would also like to thank Lyuwen Wu for her comments on the Readme file of our code repository.