-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from quantori/dev
Raw code PR
- Loading branch information
Showing
46 changed files
with
131,738 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
data/* | ||
.DS_Store | ||
**/.DS_Store | ||
.idea/ | ||
__pycache__/ | ||
wandb/* | ||
colab_training/* | ||
/.ipynb_checkpoints/ | ||
/venv | ||
/training_log/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
cff-version: 1.2.0 | ||
message: "If you use this software, please cite it as below." | ||
authors: | ||
- family-names: "Sapegin" | ||
given-names: "Denis Andzheevich" | ||
orcid: "https://orcid.org/0000-0002-1446-6288" | ||
title: "Structure Seer" | ||
version: 1.0 | ||
date-released: 2022-07-18 | ||
url: "https://github.com/quantori/structure-seer" | ||
preferred-citation: | ||
type: article | ||
authors: | ||
- family-names: "Sapegin" | ||
given-names: "Denis A" | ||
orcid: "https://orcid.org/0000-0002-1446-6288" | ||
- family-names: "Bear" | ||
given-names: "Joseph C" | ||
doi: "10.1039/D3DD00178D", | ||
journal: "Digital Discovery" | ||
title: "Structure Seer – a machine learning model for chemical structure elucidation from node labelling of a molecular graph" | ||
year: 2024 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Contributor Covenant Code of Conduct | ||
|
||
## Our Pledge | ||
|
||
In the interest of fostering an open and welcoming environment, we as | ||
contributors and maintainers pledge to make participation in our project and | ||
our community a harassment-free experience for everyone, regardless of age, body | ||
size, disability, ethnicity, sex characteristics, gender identity and expression, | ||
level of experience, education, socio-economic status, nationality, personal | ||
appearance, race, religion, or sexual identity and orientation. | ||
|
||
## Our Standards | ||
|
||
Examples of behavior that contributes to creating a positive environment | ||
include: | ||
|
||
* Using welcoming and inclusive language | ||
* Being respectful of differing viewpoints and experiences | ||
* Gracefully accepting constructive criticism | ||
* Focusing on what is best for the community | ||
* Showing empathy towards other community members | ||
|
||
Examples of unacceptable behavior by participants include: | ||
|
||
* The use of sexualized language or imagery and unwelcome sexual attention or | ||
advances | ||
* Trolling, insulting/derogatory comments, and personal or political attacks | ||
* Public or private harassment | ||
* Publishing others' private information, such as a physical or electronic | ||
address, without explicit permission | ||
* Other conduct which could reasonably be considered inappropriate in a | ||
professional setting | ||
|
||
## Our Responsibilities | ||
|
||
Project maintainers are responsible for clarifying the standards of acceptable | ||
behavior and are expected to take appropriate and fair corrective action in | ||
response to any instances of unacceptable behavior. | ||
|
||
Project maintainers have the right and responsibility to remove, edit, or | ||
reject comments, commits, code, wiki edits, issues, and other contributions | ||
that are not aligned to this Code of Conduct, or to ban temporarily or | ||
permanently any contributor for other behaviors that they deem inappropriate, | ||
threatening, offensive, or harmful. | ||
|
||
## Scope | ||
|
||
This Code of Conduct applies within all project spaces, and it also applies when | ||
an individual is representing the project or its community in public spaces. | ||
Examples of representing a project or community include using an official | ||
project e-mail address, posting via an official social media account, or acting | ||
as an appointed representative at an online or offline event. Representation of | ||
a project may be further defined and clarified by project maintainers. | ||
|
||
## Enforcement | ||
|
||
Instances of abusive, harassing, or otherwise unacceptable behavior may be | ||
reported by contacting the project team. All | ||
complaints will be reviewed and investigated and will result in a response that | ||
is deemed necessary and appropriate to the circumstances. The project team is | ||
obligated to maintain confidentiality with regard to the reporter of an incident. | ||
Further details of specific enforcement policies may be posted separately. | ||
|
||
Project maintainers who do not follow or enforce the Code of Conduct in good | ||
faith may face temporary or permanent repercussions as determined by other | ||
members of the project's leadership. | ||
|
||
## Attribution | ||
|
||
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, | ||
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html | ||
|
||
[homepage]: https://www.contributor-covenant.org | ||
|
||
For answers to common questions about this code of conduct, see | ||
https://www.contributor-covenant.org/faq |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
## Installation | ||
``` | ||
pip install -r requirements.txt | ||
``` | ||
The SCF calculation were performed using [ORCA](https://orcaforum.kofo.mpg.de/app.php/portal). | ||
To use quantum-chemistry calculations, install ORCA and specify its global path | ||
in a corresponding environment variable in ```./data_preparation/parallel_dft_calculation.py```. | ||
|
||
## Repository structure | ||
|
||
- Source code for models is located in ``` ./models ```. | ||
- Model weights trained on QM9 and PubChem Datasets are stored in ```./weights```. | ||
- Utility classes and functions are provided in ```./utils```. | ||
- Scripts for data preparation and SCF calculations can be found in ```./data_preparation```. | ||
- Parallel jobs for ORCA calculations can be run from ```./data_preparation/parallel_dft_calculation.py``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Quantori Implementation for Structure Seer model | ||
Copyright (C) 2023 Quantori, Denis Sapegin | ||
This product was designed for academic purposes by Denis Sapegin; | ||
The quantum mechanics calculations necessary for dataset preparation depend on the | ||
ORCA (https://orcaforum.kofo.mpg.de/app.php/portal) software, developed by Max-Planck-Institute für Kohlenforschung, | ||
under End User License Agreement (EULA) for the ORCA software (https://orcaforum.kofo.mpg.de/app.php/portal) | ||
|
||
|
||
In addition, this product contains dependencies on files licensed under: | ||
|
||
Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 | ||
The MIT License https://opensource.org/licenses/MIT | ||
The 3-Clause BSD License https://opensource.org/licenses/BSD-3-Clause | ||
GNU Affero General Public License v3.0 https://www.gnu.org/licenses/agpl-3.0.en.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,68 @@ | ||
# structure-seer | ||
[![DOI:10.1039/D3DD00178D](http://img.shields.io/badge/DOI-10.1039/D3DD00178D-ebe534.svg)](https://doi.org/10.1039/D3DD00178D) | ||
|
||
![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white) | ||
# Structure Seer | ||
|
||
The implementation training and evaluation of a Structure Seer model designed for | ||
reconstruction of adjacency of a molecular graph from the labelling of its nodes. | ||
The detailed characterisation and disclosure of the model architecture is provided in: | ||
[Structure Seer - a machine learning model for chemical structure elucidation | ||
from a node labelling of a molecular graph, Digital discovery, 2023](https://doi.org/10.1039/D3DD00178D) | ||
|
||
## Datasets | ||
|
||
The repository does not contain initial datasets used for training. | ||
- Small example datasets for detailed model evaluation are provided in ```./example_dataset``` | ||
- Model weights trained on QM9 and PubChem Datasets are stored in ```./weights``` | ||
|
||
## Abstract | ||
|
||
The repository contains the implementation for a novel graph convolution based machine-learning model which | ||
is designed to provide a quantitative probabilistic prediction on the connectivity of the atoms based on the | ||
information on the elemental composition of the molecule along with a list of atom-attributed isotropic shielding | ||
constants. The suggested approach holds significant potential for scalability, as it can harness vast amounts | ||
of information on known chemical structures for the model's learning process. The model architecture allows for | ||
direct structure reconstruction through prediction of molecular graph adjacency based solely on the | ||
labelling of its nodes, which potentially allows dealing with molecules of any size and composition | ||
(given an appropriate training dataset is available) without significant increase in computational resources required. | ||
|
||
## Key approaches | ||
|
||
### Unification of adjacency matrix representation | ||
|
||
The primary challenge in generating the adjacency matrix is that it is not an invariant for a given graph. | ||
For a given graph with G nodes, there are G! adjacency matrices that can describe its connectivity. | ||
To tackle this issue, the adjacency matrix representation needs to be unified. Typically, in the machine- readable | ||
representation of a molecule, its atoms are stored in the first-depth-tree traversal order. | ||
While this order contains information about the stored structure, it cannot be easily reconstructed when only | ||
the elemental composition of the molecule and the isotropic shielding constant for each atom are known. | ||
Since the shielding constant provides a unique characterization of an atom's chemical environment, it can be | ||
employed to standardize the representation of the adjacency matrix in conjunction with element information. | ||
|
||
### Generic adjacency matrix | ||
|
||
The architecture of the Structure Seer model bears similarities to other GCN-based models used for diverse tasks | ||
involving molecular graphs. However, its distinctive design is centred around encoding the molecule | ||
solely based on node labelling, which allows for the generation of the complete adjacency matrix. | ||
This feature makes the considered architecture applicable to a broad range of atom adjacency reconstruction tasks. | ||
|
||
## Training | ||
|
||
Refer to the training procedure in the Jupyter notebook ```./training.ipynb``` . | ||
Customize the procedure by adjusting the global variables in the second code cell. | ||
The main training function source code is in ```./training/train_model.py```. | ||
|
||
In order to train the model using Google Colab - extract the repository to the GDrive into ```./MyDrive```. | ||
|
||
## Evaluation | ||
|
||
For model evaluation, utilize ```./model_evaluation.ipynb``` with the pretrained model weights. | ||
Small example datasets for detailed model evaluation are provided in ```./example_dataset```. | ||
|
||
## Code examples | ||
|
||
Explore model usage and functionality in ```./structure_seer_code_examples.ipynb```, | ||
which includes illustrative examples. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
import logging | ||
import os | ||
|
||
from rdkit import Chem | ||
|
||
from utils import ( | ||
is_successful_orca_run, | ||
nmr_shielding_from_out_file, | ||
orca_output_file_check, | ||
read_sdf_compounds, | ||
) | ||
|
||
""" | ||
This script can be used to extract calculated | ||
shielding constants from .out files and append them to corresponding structures in the .sd file | ||
""" | ||
|
||
PATH_TO_SD = "../data/structures.sdf" | ||
INPUT_FOLDER = "../data" | ||
CALC_TYPE = "NMR" | ||
|
||
logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO) | ||
|
||
compounds = read_sdf_compounds(PATH_TO_SD) | ||
|
||
input_dir = os.listdir(INPUT_FOLDER) | ||
compound_nmrs = dict() | ||
|
||
# Check if folder contains corresponding .out file and ORCA terminated normally | ||
for folder in input_dir: | ||
if orca_output_file_check( | ||
path=INPUT_FOLDER, compound_id=folder, calc_type=CALC_TYPE | ||
): | ||
logging.info(f"{input_dir.index(folder)} out of {len(input_dir)} processed") | ||
# Parse NMR Shielding constants from .out file | ||
nmr = nmr_shielding_from_out_file( | ||
path_to_out_file=f"{INPUT_FOLDER}/{folder}/{folder}_{CALC_TYPE}.out" | ||
) | ||
compound_nmrs[folder] = "; ".join([str(x) for x in nmr]) | ||
|
||
n = len(compound_nmrs.keys()) | ||
with Chem.SDWriter(f"{INPUT_FOLDER}/{n}_qm9_structures_HF-3c_shielding.sdf") as w: | ||
for compound in compounds: | ||
name = compound.GetProp("_Name") | ||
if name in compound_nmrs.keys(): | ||
compound.SetProp("Shielding", str(compound_nmrs[name])) | ||
logging.info(f"Writing compound {compounds.index(compound)} of {n}") | ||
w.write(compound) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
import logging | ||
from collections import defaultdict | ||
|
||
import torch | ||
|
||
from utils import MolecularDataset | ||
|
||
""" | ||
Current script was used to perform characterisation of the datasets | ||
""" | ||
|
||
logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO) | ||
|
||
dataset = MolecularDataset( | ||
"./example_datasets/demo_compounds_qm9.sdf", | ||
absolute_norm=True, | ||
shielding_sort=True, | ||
) | ||
|
||
test_loader = torch.utils.data.DataLoader( | ||
dataset, | ||
batch_size=1, | ||
shuffle=False, | ||
drop_last=False, | ||
) | ||
|
||
atom_nums_count = defaultdict(int) | ||
bonds_count = defaultdict(int) | ||
|
||
for item in test_loader: | ||
adj_matrix = item["adjacency_matrix"] | ||
|
||
atoms_matrix = [ | ||
item["atoms_matrix"][0], | ||
item["atoms_matrix"][1], | ||
] | ||
|
||
atoms_num = torch.count_nonzero(atoms_matrix[0], dim=-1).item() | ||
num_bonds = torch.sum( | ||
torch.count_nonzero(torch.argmax(adj_matrix, dim=3), dim=1), dim=1 | ||
).item() | ||
|
||
if atoms_num in atom_nums_count.keys(): | ||
atom_nums_count[atoms_num] += 1 | ||
else: | ||
atom_nums_count[atoms_num] = 1 | ||
|
||
if num_bonds in bonds_count: | ||
bonds_count[num_bonds] += 1 | ||
else: | ||
bonds_count[num_bonds] = 1 | ||
|
||
logging.info(f"Bonds \n {bonds_count}") | ||
logging.info(f"Atoms \n {atom_nums_count}") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
import logging | ||
|
||
from tqdm import tqdm | ||
|
||
from utils import batch, generate_input_file, read_sdf_compounds | ||
|
||
""" | ||
Generates a directory with folders containing corresponding ORCA input file | ||
for each compound from .sd file specified | ||
""" | ||
|
||
# Indicate path to the .sd file, containing compounds | ||
IN_BATCHES = True | ||
BATCH_SIZE = 45000 | ||
PATH_TO_STRUCTURES_SD = "./example_datasets/demo_compounds_pubchem.sdf" | ||
INPUT_FOLDER_PATH = "./Input_folder_path" | ||
CALCULATION_TYPE = "NMR" | ||
|
||
logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO) | ||
|
||
total_compounds = read_sdf_compounds(path=PATH_TO_STRUCTURES_SD) | ||
|
||
if IN_BATCHES: | ||
batches = [] | ||
for single_batch in batch(total_compounds, BATCH_SIZE): | ||
batches.append(single_batch[1]) | ||
for i in range(len(batches)): | ||
logging.info(f"Preparing batch {i+1}") | ||
for j in tqdm(range(len(batches[i]))): | ||
compound = batches[i][j] | ||
generate_input_file( | ||
compound=compound, | ||
calc_type=CALCULATION_TYPE, | ||
save_path=f"{INPUT_FOLDER_PATH}/{len(batches[i])}_{CALCULATION_TYPE}_batch_{i+1}", | ||
) | ||
|
||
else: | ||
for compound in total_compounds: | ||
generate_input_file( | ||
compound=compound, calc_type=CALCULATION_TYPE, save_path=INPUT_FOLDER_PATH | ||
) |
Oops, something went wrong.