NLP in Human Rights Research - Extracting Knowledge Graphs About Police and Army Units and Their Commanders
This repository hosts the code of an NLP system developed during a research collaboration between Security Force Monitor, a project of the Human Rights Institute at Columbia Law School, and Dr Daniel Bauer of the Computer Science Department at Columbia University, and Yueen Ma, a post-graduate student at the same.
Our resulting working paper "NLP in Human Rights Research - Extracting Knowledge Graphs About Police and Army Units and Their Commanders", published January 2021, discusses the system's purpose, development, outcomes and performance.
The training data used to build the model is hosted in the our nlp_starter_dataset repository.
We designed a pipeline that can extract a special kind of knowledge graphs where a person's name will be recognized and his/her rank, role, title and organization will be related to him/her. It is not expected to perform perfectly so that all relevant persons will be recognized and all irrelevant persons will be excluded. Rather, it is seen as a first step to reduce the workload that is involved to manually extract such knowledge by combing through a large amount of documents.
This pipeline consists of two major components: Name Entity Recognition and Relation Extraction. Name Entity Recognition uses a BiLSTM-CNNs-CRF model. It recognizes names, ranks, roles, titles and organizations from raw text files. Then the Relation Extraction relates names to his/her corresponding rank, role, title or organization.
Tensorflow 2.2.0
Tensorflow-addons
SpaCy
NumPy
DyNet
Pathlib
Package: https://pypi.org/project/extract-sfm/
$ pip install extract_sfm
Create a python file and write:
import extract_sfm
extract_sfm.extract("/PATH/TO/DIRECTORY/OF/INPUT/FILES")
Then run the python file. This may take a while to finish.
Download this Github repository Under the project root directory, run the python script
$ python pipeline.py /PATH/TO/DIRECTORY/OF/INPUT/FILES
Note 1: Use absolute path.
Note 2: Using time_pipeline.py instead of pipeline.py will produce an additional "time.txt" file, which includes how much time each component of the pipeline takes to run.
- Copy NER_v2, RE, pipeline.py into the "SERVER/KGE" directory
- Install npm dependencies under the "SERVER" directory: express, path, multer
$ npm install <package name>
- Run the server by typing in:
$ node server.js
The documentation for NER and RE is stored in: doc.txt