This repository contains the code for reproducing the preliminary results reported in the paper "Named Entity Recognition as Graph Classification" (currently under review for the ESWC 2021 Poster Track conference).
The code is organized as notebooks, to be used as follows:
final_generate_gazetteers.ipynb
: to generate gazeteers from Wikidata (by specifying a list of QIDs corresponding to the entity types that one wishes to extract)edge_list_generation.ipynb
: to generate the graph structure to build the graph embeddings; when applied to the ConLL 2003 train dataset, one should get a similar result that this Python dict data structuregraph_embeddings_generation.ipynb
: to generate node embeddings using of the algorithms (e.g. node2ve, SDNE..) provided by the GEM librarynode2vec_classification.ipynb
: to train a model for the node2vec embeddingstransE_classification.ipynb
: to train a model for the trans-E embeddingsautoencoder_embeddings.ipynb
: to generate auto-encoder embeddings from the binary graph representationsautoencoder_classification.ipynb
: to train a model for the auto-encoder embeddingsGCN_classification.ipynb
: to train a Graph Convolution Network (based on this architecture)
The code will be streamlined into stand-alone configurable scripts and fully documented soon.
- Python 3.8
- PyTorch 1.7
- GEM
- PyTorch Geometric
- SPARQLWrapper
- tqdm
- Numpy
- Pandas
The table below shows the best performance of different models on the validation set (dev) of CoNLL-2003
Method | Accuracy | Micro-F1 | Macro-F1 |
---|---|---|---|
Auto-encoder | 91.8 | 91.5 | 71.7 |
Node2Vec | 93.8 | 94.1 | 82.1 |
Trans-E | 94.1 | 93.6 | 78.8 |
GCN | 96.5 | 96.5 | 88.8 |
As for test set performance:
Method | Micro-F1 | Macro-F1 |
---|---|---|
Auto-encoder | 91.5 | 70.4 |
Node2Vec | 91.1 | 72.6 |
Trans-E | 91.9 | 74.5 |
GCN | 94.1 | 81.0 |
LUKE | 94.3 |