This repository is dedicated to the investigation and analysis of entity normalization between RTX-KG2 (version 2.8.4c & 2.9.0c) and DrugMechDB (Andrew Su Lab).
-
Investigate Entity (Node) Variances:
- Explore and analyze differences between KG2 and DrugMechDB.
- Conduct entity normalization and alignment between two graph data.
-
Statistical Analysis:
- Assess entity normalization and alignment performance.
Each script contains the data location and instructions necessary to perform each analysis. To conduct the analysis, use the scripts (.ipynb
) in the following order below:
-
node_edge_combine.sh
shell script that combines KG2 node and edge headers together. The node & edge information is separate from its header files. The resulting files are in TSV format with node and edge headers at the top. -
entity_normalization
contains scripts for normalizing DrugMechDB node IDs to the latest Biolink standard. The resulting file is namedNodeID_update_graphs.yaml
-
entity_alignment
contains a script to conduct node alignment between KG2 and DrugMechDB. The first step uses node IDs only, whereas the second step uses node names if node IDs do not match. -
KG2_vs_DrugMechDB_Analysis_KG2.x.x.ipynb
is interactive Jupyter Notebook script that performs statistical analysis for entity alignment. -
KG2.8.4c
andKG2.9.0c
directories contain YAML files of the analysis results and figures (PNG). Each directory containsfigure
directories, which contain the visual output results of the analysis (non-interactive). These figures are saved as PNG files.
To conduct or reproduce the entity analysis. You will need DrugMechDB MOA file and KG2 versions 2.8.4c and 2.9.0c. Contact RTX-KG2 team to request access to the knowledge graphs.
To begin your analysis or reproduce the results, follow these steps:
-
Clone the repository to your local machine:
git clone https://github.com/your-username/KG2_DrugMechDB_EA_Analysis.git
-
Navigate the project directory.
-
Import, download, and decompress the data files. The location of the data for each analysis is provided in each Jupyter Notebook script.
-
Open and run the Jupyter Notebook scripts:
jupyter notebook <name_of_the_script>.ipynb
This will initiate the analysis and generate the results.
We recommend you have the following specifications to successfully run the pipeline and the scripts.
• Operating system(s): Linux (Ubuntu)
• Programming language: Shell Script (Bash) and Jupyter Notebook Script with Python 3.8.12 or higher
• Other requirements: Anaconda version 23.7.4