This repo contains a machine reading model, built on top of on document-qa, a powerful machine reading library and recent state of the art approach in open domain question answering.
This repo enables training and evaluation of models, and to provide a standard model for Cape.
the primary purpose of this repo is to train and evaluate models that implement the
cape-machine-reader.cape_machine_reader_model.CapeMachineReaderModelInterface
interface.
This repo is not designed to be used "as is" in a production environment. The model functionality is kept deliberately minimal.
Please use cape-responder
, or cape-webservices
to use models for downstream tasks.
cape-document-qa allows users to train and evaluate their own machine reading models, and allows the simultaneous training of both supervised and semi-supervised machine reading tasks, such as SQuAD and TriviaQA
The original document-qa library can be found here: https://github.com/allenai/document-qa (We link to it as a submodule). The original publication describing some of how document-qa works can be found here: Simple and Effective Multi-Paragraph Reading Comprehension.
Cape-document-qa models are similar to the "shared-norm" model described in the paper above, but differ in some details, the biggest being that we include an ELMO language model in our models, and we co-train models on both Squad and TriviaQA, retaining strong performance on both datasets.
In order to have clean installs, and to enable functionality such as tensorflow 1.7, we have made several patches to document-qa. These patches have been kept as minimal as possible, but do bear this in mind when using the codebase.
Cape-document-qa
can be used simply to load pretrained models and use them in inference mode, or can also be used
to train models from scratch or finetune models.
Training and running cape-document-qa models requires tensorflow 1.7 You should install tensorflow using the documentation especially if using gpus with CUDA and CUDNN
To install as a site-package:
pip install --upgrade --process-dependency-links git+https://github.com/bloomsburyai/cape-document-qa
To use locally, and run commands from the project root (recommended for users planning on training their own models):
git clone https://github.com/bloogram/cape-document-qa.git
git submodule init
git submodule update
cd cape-document-qa
export PYTHONPATH="${PYTHONPATH}:./document-qa/"
pip install --upgrade --process-dependency-links git+https://github.com/bloomsburyai/cape-machine-reader
pip install -r requirements.txt
You will also need a model. A Pretrained model will be downloaded automatically when the library is first imported.
This model will be downloaded to cape_document_qa/storage/models
and contains all required data out of the box.
The space footprint of this model is about 5GB, the majority of which are Glove Word vectors.
To check that the install is successful, we can run the tests:
pytest cape_document_qa
There is a script that will allow users to evaluate datasets in the squad format. This is useful for those who have some data for their domain, and just want to see how the pretrained model performs. This can be done easily:
$ python3
>>> from cape_document_qa.evaluation.evaluate_benchmark import perform_benchmark_evaluation
>>> perform_benchmark_evaluation('my_dataset', ['path/to/my/dataset-v1.1.json'])
Preprocessing Squad Dataset: my_dataset
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 21.83it/s]
dev: 100%|████████████████████████████████████████| 1/1 [00:00<00:00, 51.41it/s]
Dumping file mapping
Dumping vocab mapping
Setting Up:
Had pre-trained word embeddings for 149 of 149 words
Building question/paragraph pairs...
Processing 8 chunks with 8 processes
100%|██████████████████████████████████████████| 55/55 [00:00<00:00, 365.91it/s]
Starting Eval
my_dataset: 100%|█████████████████████████████████| 1/1 [00:24<00:00, 24.70s/it]
scoring: 100%|████████████████████████████████| 55/55 [00:00<00:00, 7454.73it/s]
Exporting and Post-processing
Saving question result
Saving paragraph result
Computing scores
or equivalently from the command line:
python -m cape_document_qa.evaluation.evaluate_benchmark -n my_dataset -t 'path/to/my/dataset-v1.1.json'
This will create 3 output files, by default named:
- {my_dataset}_official_output.json which can be used in squad official evaluation scripts
- {my_dataset}_aggregated_output.csv which includes the F1 and EM scores as the number of paragraphs increases
- {my_dataset}_paragraph_output.csv which includes detailed information about the answers
Training your own models is encouraged. You can use the cape-document-qa
training scripts to train and tweak
models. If you want to define your own architecture, or even use your own codebase to train a model, this should
be achievable too, you just need to make your model inherit the
cape-machine-reader.cape_machine_reader_model.CapeMachineReaderModelInterface
interface.
(see cape_document_qa.cape_docqa_machine_reader
for an example).
If you are training models, you may find it easier to do a local install. In this case, you should ensure that the docqa module within document-qa is on your PYTHONPATH.
We suggest using our model configs as a good starting place to fine tune your own models.
Training a model requires you to
- download the traind data and preprocess
- run the training script
- evaluate the model
- make the model "production ready".
These steps are described below. Each is can be achieved by running one or two scripts
Before training, There is significant preprocessing that needs to be done. This process can take several hours (if preprocessing all of squad and triviaqa). By default most of the pipeline is multi-threaded (default is 8 processes).
Datasets for training, and other resources (including ELMO parameters) are downloaded and handled
Downloading the Training data will be automatically triggered when running the cape_document_qa.cape_preprocess
script
The following are downloaded:
- Elmo parameters,
- the squad dataset,
- the web and wiki triviaqa datasets
- glove vectors
By default, we expect source data to be stored in \~/data
and preprocessed data to be
stored in {project_root}/data
. These can be changed by altering cape_document_qa/cape_config.py
Preprocessing can be run by:
python -m cape_document_qa.cape_preprocess --dataset_dict path/to/datsets_dict.json
You'll need to define a dataset_dict.json
. This simply tells the preprocessing what data should
go into what data fold. The default dataset_dict
is shown below:
{
"triviaqa_datasets": ["wiki", "web"],
"squad_datasets": {
"squad": {
"train": ["squad-train-v1.1.json"],
"dev": ["squad-dev-v1.1.json"],
"test": ["squad-dev-v1.1.json"],
}
}
}
(you can run without specifying a dataset_dict
, which will preprocess triviqaQA Wiki, triviaQA Web and Squad)
Preprocessing will perform the following steps (in order):
- Tokenize squad documents
- Tokenize triviaQA documents
- Tokenize questions and build supporting objects, and pickle them
- Create elmo token embeddings for the whole dataset's vocab
adding your own datasets should be straightforward. Save them using the squad-v1.1 format, place them
in the same location as the squad json files (specified by cape_document_qa.cape_config.SQUAD_SOURCE_DIR)
and then preprocess
them as you did for squad, by running cape_document_qa.cape_preprocess
with an updated dataset_dict.json
, e.g.:
{
"triviaqa_datasets": ["wiki", "web"],
"squad_datasets":{
"squad": {
"train": ["squad-train-v1.1.json"],
"dev": ["squad-dev-v1.1.json"],
"test": ["squad-dev-v1.1.json"],
},
"my_dataset" : {
"train": ["my_dataset-train-v1.1.json"]
"dev": ["my_dataset-dev-v1.1.json"],
"test": ["my_dataset-test-v1.1.json"],
}
}
}
Once all the data has been preprocessed, you can train a model. This requires a gpu, we recommend at least 12GB of GPU memory. Training can be slow, especially when co-training triviaqa and Squad (over 50 hours to converge on some slower hardware), so be aware. Squad-only models will train much faster (10-15 hours)
To train a model, use cape_document_qa.training.cape_ablate
to specify what datasets to train your model on, you can define a dataset_sampling_dict, e.g.
to train on triviaqa web, triviaqa wiki, squad and a 2x oversampling of my_dataset
, the
dataset_sampling_dict.json
would look like this:
{
"wiki": 1,
"web": 1,
"squad": 1,
"my_dataset": 2,
}
Training a model could the be done using:
python -m cape_document_qa.training.cape_ablate name_of_my_model --dataset_sampling path/to/my_datset_sampling.json --cudnn
After some preparatory preprocessing and loading (sometimes up to 1 hour if training on a lot of data), a model will start to train. It will create a model directory, and you can track training progress using tensorboard and pointing it at the logs subdirectory of the model directory.
tensorboard --logdir /path/to/my/model/logs/.
Sometimes runs break down, or you may want to try to fine tune a pretrained model with new data.
You can resume training a model using cape_document_qa.training.cape_resume_training.py
:
python -m cape_document_qa.training.cape_resume_training path/to/my/model --dataset_sampling path/to/my_datset_sampling.json
When a model has finished, there are two evalution scripts. One uses document-qa's own evaluation pipeline and can be called like:
python -m cape_document_qa.evaluation.cape_docqa_eval path/to/my/model \
--paragraph_output path/to/paragraph_output.csv \
--aggregated_output path/to/aggregated_output.csv \
--official_output path/to/official_output.json \
--datasets squad wiki web my_dataset
This script will produce three files, one with paragraph level answers, one with EM and F1 scores for each dataset, and an "official" json format, like that used for squad model evaluation.
The other evaluation uses Cape's answer generator, which enables faster generation of the top k answers as well as several heuristics that are useful for the user experience
This can be called like:
python -m cape_document_qa.evaluation.cape_multidoc_eval path/to/my/model -k 5 --datasets squad wiki web my_dataset
which will produce a file for each dataset with the top k answers for each question.
Once you have trained a model and you are happy with the results, you can gather the resources and slim down the model files using
cape_document_qa.cape_productionize_model --target_model path/to/my/trained_model --output_dir path/to/my/output_model
This will also convert the RNNs to be CPU compatible, so can be run on systems without nvidia gpus. This
production ready model can now be used by cape_document_qa.cape_docqa_machine_reader
and be used
by the rest of the stack.
The entire training procedure could be achieved using something like the following :
# define the datasets to prepro
echo '{
"triviaqa_datasets": ["wiki", "web"],
"squad_datasets":{
"squad": {
"train": ["squad-train-v1.1.json"],
"dev": ["squad-dev-v1.1.json"],
"test": ["squad-dev-v1.1.json"],
},
"my_dataset" : {
"train": ["my_dataset-train-v1.1.json"]
"dev": ["my_dataset-dev-v1.1.json"],
"test": ["my_dataset-test-v1.1.json"],
}
}
}' > datasets_to_prepro.json
# define the equivalences of each dataset to train on:
echo '{
"wiki": 1,
"web": 1,
"squad": 1,
"my_dataset": 2,
}' > dataset_sampling.json
# Preprocess for a few hours
python -m cape_document_qa.cape_preprocess --dataset_dict datasets_to_prepro.json
# train for many hours
python -m cape_document_qa.training.cape_ablate my_model --dataset_sampling datset_sampling.json --cudnn
# evaluate how well the model performs on my_dataset using document_qa
python -m cape_document_qa.evaluation.cape_docqa_eval my_model \
--paragraph_output my_model_paragraph_output.csv \
--aggregated_output my_model_aggregated_output.csv \
--official_output my_model_official_output.json \
--datasets my_dataset
# evaluate how well Cape's multiple answering method performs on my_dataset:
python -m cape_document_qa.evaluation.cape_multidoc_eval my_model -k 5 --datasets my_dataset
# all has gone well, productionize my model
cape_document_qa.cape_productionize_model --target_model my_model --output_dir my_production_model
We have an experimental Horovod implementation that will allow you to scale up your training to several GPUS or even several nodes.
This is not thoroughly tested but can be used to train models using cape_document_qa.training.cape_ablate_horovod
.
This script is analogous to cape_document_qa.training.cape_ablate
.
You should run it using OpenMPI (which we assume you already have installed.)
E.g. to run on 4 GPUs on the local machine:
mpirun -np 4 \
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python -m cape_document_qa.training.cape_ablate_horovod horovod --n_processes 4