Skip to content
/ ImpNatUKE Public
forked from AKSW/natuke

NatUKE: Natural Product Knowledge Extraction Benchmark

License

Notifications You must be signed in to change notification settings

AKSW/ImpNatUKE

This branch is 9 commits ahead of AKSW/natuke:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

3a654ff · Sep 23, 2024

History

60 Commits
May 31, 2022
Jun 9, 2022
Sep 23, 2024
Sep 23, 2024
Sep 23, 2024
May 31, 2022
Oct 17, 2022
Sep 23, 2024
Sep 20, 2024
Sep 20, 2024
May 31, 2022
Jun 13, 2022
May 31, 2022
May 31, 2022
Sep 23, 2024
Sep 20, 2024
May 31, 2022
Sep 20, 2024
Jun 13, 2022
Apr 12, 2023
Jun 7, 2022
Sep 20, 2024
Sep 20, 2024
May 31, 2022
Feb 1, 2023
May 31, 2022
Mar 29, 2023
May 31, 2022
May 31, 2022
Sep 20, 2024

Repository files navigation

ImpNatUKE: Improving Natural Product Knowledge Extraction from Academic Literature on NatUKE Benchmark

Welcome to ImpNatUKE! Here we present usability explanations based on the NatUKE benchmark as well as the additions we made and how to use them.

Usability

Original NatUKE source code explanation for understanding, running and evaluating experiments.

Source code breakdown

(this source code breakdown is also the regular flow of execution from natuke)

Here we explain all the source code in the repository and the order in which to execute them:

  1. clean_pdfs.ipynb: load pdfs considering database and prepare two dataframes to be used further;
  2. phrases_flow.py: load texts dataframe and separate the texts into 512 tokens phrases;
  3. topic_generation.ipynb: load phrases dataframe and create a topic cluster using BERTopic [4];
  4. topic_distribution.ipynb: load BERTopic model the phrases dataframe and distributes the topics filtering according to an upper limit of the proportion and outputs the dataframe;
  5. hin_generation.ipynb: load the filtered topics dataset and paper information to generate the usable knowledge graph;
  6. knn_dynamic_benchmark.py: runs the experiments using the generated knowledge graph, considering the parametrers set on the main portion of the code; (option for BiKE challenge (https://aksw.org/bike/)) knn_dynamic_benchmark_splits.py: runs the experiments using the splits tailored for the BiKE (First International Biochemical Knowledge Extraction Challenge: http://aksw.org/bike/) challenge;
  7. dynamic_benchmark_evaluation.py: generates hits@k and mrr metrics for the experiments, allowing different parameters to be set for the algorithms used as well as the metrics;
  8. execution_time_processer.py: processes the dynamically .txt generated by knn_dynamic_benchmark.py experiments into a dataframe of execution times;
  9. metric_graphs.py: with the metric results and execution times allows the generation of personalized graphs;
  • natuke_utils.py: contains the source for the: methods; split algorithms; similar entity prediction; and metrics.
  • exploration.ipynb: used to explore data, as for the quantities of each property.

ImpNatUKE source code breakdown

Here we explain the source code changes from NatUKE and where they apply:

  1. clean_grobid.py: cleans the processed .xml files outputted from GROBID (https://github.com/kermitt2/grobid) into .txt files that will be used to generate the embeddings. It substitutes clean_pdfs.ipynb and phrases_flow.py for the GROBID experiments;
  2. clean_nougat.py: cleans the processed .md files outputted from Nougat (https://github.com/facebookresearch/nougat) into .txt files that will be used to generate the embeddings. It substitutes clean_pdfs.ipynb and phrases_flow.py for the Nougat experiments;
  3. old_processed_pdfs.py: saves the PyMuPdf extractions into .txt files for usage in LLM embedding generation;
  4. hin_save_splits.py: loads the embeddings file and connects it with the topics and other data to generate the networkx Graph and saves the train/test split files on the BiKE challenge format. It substitutes hin_generation.ipynb for LLM experiments;
  5. hin_bert_splits.py: loads the processed .txt files from either GROBID or Nougat and uses the original DistilBERT model to generate the embeddings before connecting it to the topics and other data to generate the networkx Graph and saving the train/test split files on the BiKE challenge format. It substitutes hin_generation.ipynb for BERT experiments with the new PDF file processors;
  6. tsne_plot.ipynb: generates 2D tsne plots for the embedding models for a visual comparation of the reduced embeddings.
  7. generate_embs_llms.ipynb: code to generate LLM embeddings for txts.

Instalation and running

All experiments were tested with a conda virtual environment of Python 3.8. With conda installed the virtual envs should be created with:

conda create --name [name] python=3.8

Install the requirements:

cd natuke
conda activate [name]
pip install -r requirements.txt
pip install ollama

GraphEmbeddings

GraphEmbeddings submodule based on https://github.com/shenweichen/GraphEmbedding but the used algorithms works with tf 2.x

To install this version of GraphEmbeddings run:

cd GraphEmbeddings
python setup.py install

To run the benchmark execute knn_dynamic_benchmark.py after adding the repository with the data and adding the kg name in the code. Other parameters can also be changed within the code. You can access a full KG and splits at: https://drive.google.com/drive/folders/1NXLQQsIXe0hz32KSOeSG1PCAzFLHoSGh?usp=sharing

Metapath2Vec

metapath2vec submodule based on: https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/metapath2vec-link-prediction.html

Enviroments compatibility

For a better user experience we recommend setting up two virtual environments for running biologist:

  • requirements.txt for all the codes, except topic_distribution.ipynb; topic_generation.ipynb; and hin_generation.ipynb;
  • requirements_topic.txt for topic_distribution.ipynb; topic_generation.ipynb; and hin_generation.ipynb (BERTopic requires a different numpy version for numba).

Preparing new files for performance evaluation

The files must follow a naming convention, for example knn_results_deep_walk_0.8_doi_bioActivity_0_2nd.csv:

  • knn_results is the name of the experiments batch used in natuke;
  • deep_walk is the name of the algorithm;
  • 0.8 represents the maximum train/test splits percentage in the evaluation stages;
  • doi_bioActivity is the edge_type restored for evaluation;
  • 0 is the random sampling for the splits;
  • and 2nd is the evaluation_stage.

All these parameters can be changed as needed in this portion of the dynamic_benchmark_evaluation.py

path = 'path-to-data-repository'
file_name = "knn_results"
splits = [0.8]
#edge_groups = ['doi_name', 'doi_bioActivity', 'doi_collectionSpecie', 'doi_collectionSite', 'doi_collectionType']
edge_group = 'doi_collectionType'
#algorithms = ['bert', 'deep_walk', 'node2vec', 'metapath2vec', 'regularization']
algorithms = ['deep_walk', 'node2vec', 'metapath2vec', 'regularization']
k_at = [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
dynamic_stages = ['1st', '2nd', '3rd', '4th']

Benchmark

More information about the NatUKE benchmark is available at https://github.com/aksw/natuke#benchmark.

Data

More information about the dataset used for evaluation is available at https://github.com/aksw/natuke#data.

Models

More information about the models evaluated are available at https://github.com/aksw/natuke#models.

Results

Original NatUKE

Results from the original NatUKE benchmark are available at https://github.com/aksw/natuke#results.

ImpNatUKE

Results with the PDF file processors PyMuPdf, GROBID and Nougat for the language model DistilBERT:

DistilBERT
PyMuPdf GROBID Nougat
k 1st 2nd 3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
C 50 0.09 ± 0.01 0.02 ± 0.01 0.03 ± 0.02 0.04 ± 0.05 0.09 ± 0.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.09 ± 0.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
B 5 0.55 ± 0.06 0.57 ± 0.07 0.60 ± 0.08 0.64 ± 0.07 0.58 ± 0.05 0.64 ± 0.03 0.69 ± 0.06 0.73 ± 0.08 0.59 ± 0.06 0.66 ± 0.05 0.69 ± 0.05 0.71 ± 0.11
1 0.17 ± 0.05 0.19 ± 0.05 0.24 ± 0.06 0.25 ± 0.06 0.19 ± 0.02 0.23 ± 0.03 0.28 ± 0.06 0.35 ± 0.10 0.19 ± 0.02 0.25 ± 0.04 0.30 ± 0.06 0.33 ± 0.10
S 50 0.36 ± 0.04 0.24 ± 0.03 0.29 ± 0.07 0.30 ± 0.06 0.34 ± 0.03 0.24 ± 0.03 0.29 ± 0.06 0.34 ± 0.10 0.34 ± 0.03 0.23 ± 0.03 0.29 ± 0.05 0.30 ± 0.10
20 0.10 ± 0.02 0.15 ± 0.03 0.19 ± 0.05 0.22 ± 0.07 0.10 ± 0.03 0.17 ± 0.03 0.22 ± 0.04 0.28 ± 0.07 0.11 ± 0.03 0.18 ± 0.03 0.21 ± 0.03 0.25 ± 0.09
L 20 0.53 ± 0.03 0.52 ± 0.06 0.55 ± 0.04 0.55 ± 0.06 0.56 ± 0.04 0.62 ± 0.03 0.62 ± 0.05 0.62 ± 0.08 0.56 ± 0.04 0.62 ± 0.03 0.63 ± 0.05 0.65 ± 0.08
5 0.26 ± 0.04 0.29 ± 0.05 0.30 ± 0.07 0.27 ± 0.07 0.28 ± 0.04 0.35 ± 0.05 0.36 ± 0.04 0.35 ± 0.08 0.27 ± 0.05 0.31 ± 0.04 0.35 ± 0.08 0.38 ± 0.09
T 1 0.71 ± 0.04 0.66 ± 0.10 0.75 ± 0.10 0.75 ± 0.11 0.77 ± 0.02 0.75 ± 0.04 0.76 ± 0.05 0.77 ± 0.06 0.78 ± 0.01 0.78 ± 0.04 0.78 ± 0.05 0.80 ± 0.09

Results with the PDF file processors PyMuPdf, GROBID and Nougat for the language model llama-3.1.:

llama-3.1
PyMuPdf GROBID Nougat
k 1st 2nd 3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
C 50 0.09 ± 0.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.09 ± 0.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
B 5 0.51 ± 0.07 0.51 ± 0.04 0.51 ± 0.06 0.54 ± 0.08 0.52 ± 0.06 0.46 ± 0.03 0.45 ± 0.03 0.46 ± 0.07
1 0.13 ± 0.03 0.11 ± 0.03 0.11 ± 0.03 0.14 ± 0.04 0.12 ± 0.02 0.09 ± 0.02 0.08 ± 0.03 0.08 ± 0.04
S 50 0.34 ± 0.04 0.23 ± 0.03 0.28 ± 0.07 0.26 ± 0.06 0.34 ± 0.03 0.22 ± 0.03 0.25 ± 0.04 0.26 ± 0.11
20 0.10 ± 0.03 0.11 ± 0.03 0.11 ± 0.03 0.13 ± 0.05 0.09 ± 0.02 0.11 ± 0.04 0.12 ± 0.04 0.13 ± 0.09
L 20 0.55 ± 0.05 0.58 ± 0.04 0.59 ± 0.06 0.55 ± 0.09 0.56 ± 0.04 0.58 ± 0.04 0.54 ± 0.07 0.53 ± 0.08
5 0.23 ± 0.04 0.22 ± 0.03 0.23 ± 0.06 0.22 ± 0.04 0.19 ± 0.05 0.18 ± 0.05 0.17 ± 0.06 0.13 ± 0.04
T 1 0.64 ± 0.11 0.58 ± 0.10 0.55 ± 0.12 0.55 ± 0.10 0.57 ± 0.07 0.62 ± 0.07 0.62 ± 0.06 0.58 ± 0.11

Results with the PDF file processors PyMuPdf, GROBID and Nougat for the language model Gemma 2:

Gemma 2
PyMuPdf GROBID Nougat
k 1st 2nd 3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
C 50 0.09 ± 0.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
B 5 0.52 ± 0.05 0.51 ± 0.09 0.53 ± 0.06 0.58 ± 0.10
1 0.13 ± 0.02 0.14 ± 0.03 0.12 ± 0.04 0.16 ± 0.05
S 50 0.34 ± 0.04 0.22 ± 0.03 0.26 ± 0.05 0.24 ± 0.08
20 0.10 ± 0.03 0.11 ± 0.02 0.12 ± 0.03 0.11 ± 0.06
L 20 0.56 ± 0.04 0.55 ± 0.04 0.57 ± 0.06 0.55 ± 0.08
5 0.22 ± 0.04 0.22 ± 0.03 0.25 ± 0.04 0.23 ± 0.08
T 1 0.74 ± 0.06 0.71 ± 0.10 0.68 ± 0.16 0.69 ± 0.15

License

The code and experiments are available as open source under the terms of the Apache 2.0 License.

The dataset used for training and benchmark are available under the license Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). Which allows the use of the data only on its current form.

Wiki

The original NatUKE benchmark has an extended version of the paper and other information at the wiki page: https://github.com/AKSW/natuke/wiki/NatUKE-Wiki.

About

NatUKE: Natural Product Knowledge Extraction Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.5%
  • Python 4.5%