for preprocessing article data for use with Name-Entity Recognition (NER) tools
All modules can be run from the src
directory using python3 -m preprocess.[module_name] [args]
scripts for preprocessing article data
preprocess.chem_ner
for performing chemical name-entity recognition on paragraph data in parform format
- Runs tmChem using a process pool on a single machine
- Requires Perl to be available in
$PATH
(seecluster/setup_tmchem.sh
) - Run using
python3 -m preprocess.chem_ner [path_to_paragraph_documents] [output_directory] --tmchem [path/to/tmChem.pl] --logdir [path/to/save/logfile]
preprocess.chem_ner_cluster
for running preprocess.chem_ner on a PBS cluster
- Takes input file list and subdivides it into chunks, creates symlinks to original files for each chunk, and creates (and optionally submits) a PBS job file to run each chunk
- Works with huge input directory trees and uses minimal RAM, unless using
--resume
argument - Run
python3 -m preprocess.chem_ner_cluster -h
for help/options
preprocess.create_medline_subset
for creating a subset of medline based on a list of PMIDs read in from a plain text file (one PMID per line)
- Run using
python3 -m preprocess.create_medline_subset -h
to see help/options
preprocess.create_pubtator_subset
for creating a subset of the pubtator bioconcepts2pubtator_offsets download file based on PMIDs in an input file (one PMID per line)
- Run using
python3 -m preprocess.create_pubtator_subset -h
to see help/options - creates a directory with pubtator annotations (abstract+offset) saved with one abstract per file
- creates another directory with only the abstracts saved in parsed/paragraph format for use with other bioshovel.preprocess modules
preprocess.disease_ner
for performing disease name-entity recognition on paragraph data in parform format
- Runs DNorm using a process pool on a single machine
- Requires Java to be available in
$PATH
(tested with OpenJDK Java 7 and Oracle Java 8) - Run using
python3 -m preprocess.disease_ner [path_to_paragraph_documents] [output_directory] --dnorm [path/to/dnorm/ApplyDNorm.sh/directory] --logdir [path/to/save/logfile]
preprocess.gene_ner
for performing gene name-entity recognition on parsed paragraph data in parform format.
- Runs GNormPlus using a process pool on a single machine
- Requires Perl to be available in
$PATH
- Run using
python3 -m preprocess.gene_ner [path_to_paragraph_documents] [output_directory] --gnormplus [path/to/GNormPlus.pl] --logdir [path/to/save/logfile]
preprocess.parse_elife_xml
parses eLife articles in XML format from an input directory.
- Extracts a DOI, executive summary paragraphs, abstract paragraph, and body paragraphs while removing all figure and journal citations.
- Run using
python3 -m preprocess.parse_elife_xml [xml_directory]
preprocess.parse_medline_xml
parses article abstracts out from MEDLINE XML (and optionally .xml.gz) files. Run using python3 -m preprocess.parse_medline_xml -h
to see various options
preprocess.pmc_prettyprint
creates a pretty-printed version of the PubMed Central (or other) XML-based corpus alongside original corpus directory. Run using python3 -m preprocess.pmc_prettyprint [pmc_xml_directory]
preprocess.prep_corenlp
prepares a corpus in parsed/paragraph format for processing with Stanford CoreNLP.
- For use on a PBS cluster.
- Run
python3 -m preprocess.prep_corenlp -h
for help and options - Can be run using:
python3 -m preprocess.prep_corenlp [path/to/paragraphs] [output/directory] --corenlp [path/to/coreNLP/installation] --submit
preprocess.reformat
functions for reformatting article data to/from various file formats
preprocess.util
general utility/helper functions for file handling, logging, etc.
functions and scripts for running tools on a PBS cluster
preprocess.cluster.util
utility functions for running on a PBS cluster (such as a wrapper for the job submission command qsub
)
for set up of local NER tools
setup/setup_tmchem.sh
bash script for setting up tmChem locally