A template project for those wanting to create an index locorum for their publications.
See INSTALL.md
.
This template comes with batteries included, but you will have to adapt a bit the configuration. The project configuration file is in config/project.ini
.
Make sure you change the path for the following settings:
preproc.treetagger_home
: the path toTreeTagger
This template project comes a short example document, i.e.Bryn Mawr ClassicalReview 2013-01-10.
The documents to be processed need to be placed in the sub-folder orig
within your working directory.
In this example project general.working_dir = ./data
, thus the input files are placed in ./data/orig/
. The script will then create further subfolders to store temporary or intermediate files.
When you install the CitationExtractor
(version >= 1.7.0
) the bash command citedloci-pipeline
will be automatically installed in your system, which allows you to run the pipeline.
For a detailed explanation of each pipeline step, please refer to the Jupyter notebook step-by-step.ipynb
.
citedloci-pipeline do preproc --config=config/project.ini
At this point you should have a tokenized and PoS-tagged file at data/iob/bmcr_2013-01-10.txt
(if you've kept the default project settings).
Try:
cat data/iob/bmcr_2013-01-10.txt
citedloci-pipeline do ner --config=config/project.ini
At this point you should have a JSON file with entities annotated at data/json/bmcr_2013-01-10.json
.
Try:
# requires jq, see https://stedolan.github.io/jq/download/
cat data/json/bmcr_2013-01-10.json|jq ".entities"
citedloci-pipeline do relex --config=config/project.ini
At this point you should have a JSON file with relations annotated at data/json/bmcr_2013-01-10.json
(it overwrites the previous one).
Try:
# requires jq, see https://stedolan.github.io/jq/download/
cat data/json/bmcr_2013-01-10.json|jq ".relations"
citedloci-pipeline do ned --config=config/project.ini
At this point you should have a JSON file with entities disambiguated at data/json/bmcr_2013-01-10.json
(it overwrites the previous one).
Try:
# requires jq, see https://stedolan.github.io/jq/download/
cat data/json/bmcr_2013-01-10.json|jq ".entities"