Here is the relevant open-source code for the article titled “Improving Issue-PR Link Prediction via Knowledge-aware Heterogeneous Graph”
In this work, we designe an approach, named AIPL, capable of predicting Issue-PR links on GitHub. It leverages the heterogeneous graph to model multi-type GitHub data and employ the metapath-based technology to incorporate crucial information transmitting among multi-type data. When given a pair of an issue and a PR, AIPL can suggest whether there could be a link.
AIPL is implemented by PyTorch over a server equipped with NVIDIA GTX 1060 GPU.
- python 3.7.3
- PyTorch 1.13.1
- NumPy 1.21.5
- Pandas 1.3.4
- scikit-learn 1.0.2
- scipy 1.7.3
- DGL 0.6.1
- NetworkX 2.6.3
We release our annotated dataset in this file dir.
Annotated dataset based on repositories facebook_react and vuejs/vue
- Index Information of nodes and edges on heterogeneous graph
- Features Embeddings of nodes
- Training set& Test set Annotated dataset
- adjM.npz The adjacency matrix of heterogeneous graph
Note that, all the files regarding metapaths are so big that it's hard to upload them to this open-source repository. However, all the required files can be obtained by running the file construct_metapath.py
.
- baseline
The code of our baselines, including iLinker, A-M, random walk, metapath2vec, R-GCN, GTN, Simple-HGN, HGT, HAN, Sehgnn, and MECCH.
- AIPL
The code of AIPL, please read the following introduction for a better understanding.
The relevant codes of our method include building heterogeneous graph, constrcuting metapath and training graph-based model.
The first step is to run build_graph.py
. The second step is to run construct_metapath.py
. The third strp is to run AIPL_main.py
The detailed explanations are as follows:
build_graph.py
The code snippet constructs a heterogeneous graph and generates node features for users, repositories (repos), issues, and pull requests (PRs).
It loads data related to various relationships like user-repo, user-issue, user-PR, repo-repo, repo-issue, repo-PR, issue-issue, issue-PR, and PR-PR from corresponding directories and creates an adjacency matrix (adjM).
Additionally, it extracts feature vectors such as title vectors from CSV files to create features for repos, issues, and PRs.
The code then saves the adjacency matrix and node features in numpy arrays for further analysis.
construct_metapath.py
The code first loads data from various edge and index files, including user-repo, user-issue, user-pr, repo-repo, repo-issue, repo-pr, issue-issue, issue-pr, and pr-pr.
It then loads adjacency matrices and organizes them into lists based on different node types such as users, repositories, issues, and prs.
Next, the code generates expected metapaths based on predefined patterns. These metapaths are then mapped to corresponding indices and stored in pickle files, numpy arrays, and adjacency lists for further analysis and processing.
AIPL_main.py
The code is related to the model training and model inferences. User can train and evaluate AIPL by running AIPL_main.py
.
The script handles data loading, model setup, training with early stopping, and evaluation using metrics like accuracy, precision, recall, and F1-score.
Specifically, the functions of loading data and batching are called using the files data.py
, preprocess.py
, and tools.py
in the 'utils' folder.
Regarding the construction of the AIPL model, it includes intra-metapath aggregation, inter-metapath aggregation, and attention mechanism.
These codes are presented in the base_magnn.py
and magnn_lp.py
under the 'magnn_model' directory, directly called by 'AIPL_main'."
Also, you can set the series of parameters in this py file, including learning_rate, epoch_number, drop_out, attention head number, instance encoder.
All copyright of the tool is owned by the author of the paper.