Skip to content

Commit

Permalink
Merge pull request #1 from quantori/dev
Browse files Browse the repository at this point in the history
Raw code PR
  • Loading branch information
Membrizard authored Jan 11, 2024
2 parents 462c5bc + 10ee238 commit 23b13d1
Show file tree
Hide file tree
Showing 46 changed files with 131,738 additions and 2 deletions.
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
data/*
.DS_Store
**/.DS_Store
.idea/
__pycache__/
wandb/*
colab_training/*
/.ipynb_checkpoints/
/venv
/training_log/
22 changes: 22 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Sapegin"
given-names: "Denis Andzheevich"
orcid: "https://orcid.org/0000-0002-1446-6288"
title: "Structure Seer"
version: 1.0
date-released: 2022-07-18
url: "https://github.com/quantori/structure-seer"
preferred-citation:
type: article
authors:
- family-names: "Sapegin"
given-names: "Denis A"
orcid: "https://orcid.org/0000-0002-1446-6288"
- family-names: "Bear"
given-names: "Joseph C"
doi: "10.1039/D3DD00178D",
journal: "Digital Discovery"
title: "Structure Seer – a machine learning model for chemical structure elucidation from node labelling of a molecular graph"
year: 2024
76 changes: 76 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Contributor Covenant Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to make participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
15 changes: 15 additions & 0 deletions DEVNOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
## Installation
```
pip install -r requirements.txt
```
The SCF calculation were performed using [ORCA](https://orcaforum.kofo.mpg.de/app.php/portal).
To use quantum-chemistry calculations, install ORCA and specify its global path
in a corresponding environment variable in ```./data_preparation/parallel_dft_calculation.py```.

## Repository structure

- Source code for models is located in ``` ./models ```.
- Model weights trained on QM9 and PubChem Datasets are stored in ```./weights```.
- Utility classes and functions are provided in ```./utils```.
- Scripts for data preparation and SCF calculations can be found in ```./data_preparation```.
- Parallel jobs for ORCA calculations can be run from ```./data_preparation/parallel_dft_calculation.py```
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2023 Quantori, Denis Sapegin

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
14 changes: 14 additions & 0 deletions NOTICE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Quantori Implementation for Structure Seer model
Copyright (C) 2023 Quantori, Denis Sapegin
This product was designed for academic purposes by Denis Sapegin;
The quantum mechanics calculations necessary for dataset preparation depend on the
ORCA (https://orcaforum.kofo.mpg.de/app.php/portal) software, developed by Max-Planck-Institute für Kohlenforschung,
under End User License Agreement (EULA) for the ORCA software (https://orcaforum.kofo.mpg.de/app.php/portal)


In addition, this product contains dependencies on files licensed under:

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0
The MIT License https://opensource.org/licenses/MIT
The 3-Clause BSD License https://opensource.org/licenses/BSD-3-Clause
GNU Affero General Public License v3.0 https://www.gnu.org/licenses/agpl-3.0.en.html
69 changes: 68 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,68 @@
# structure-seer
[![DOI:10.1039/D3DD00178D](http://img.shields.io/badge/DOI-10.1039/D3DD00178D-ebe534.svg)](https://doi.org/10.1039/D3DD00178D)

![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white)
# Structure Seer

The implementation training and evaluation of a Structure Seer model designed for
reconstruction of adjacency of a molecular graph from the labelling of its nodes.
The detailed characterisation and disclosure of the model architecture is provided in:
[Structure Seer - a machine learning model for chemical structure elucidation
from a node labelling of a molecular graph, Digital discovery, 2023](https://doi.org/10.1039/D3DD00178D)

## Datasets

The repository does not contain initial datasets used for training.
- Small example datasets for detailed model evaluation are provided in ```./example_dataset```
- Model weights trained on QM9 and PubChem Datasets are stored in ```./weights```

## Abstract

The repository contains the implementation for a novel graph convolution based machine-learning model which
is designed to provide a quantitative probabilistic prediction on the connectivity of the atoms based on the
information on the elemental composition of the molecule along with a list of atom-attributed isotropic shielding
constants. The suggested approach holds significant potential for scalability, as it can harness vast amounts
of information on known chemical structures for the model's learning process. The model architecture allows for
direct structure reconstruction through prediction of molecular graph adjacency based solely on the
labelling of its nodes, which potentially allows dealing with molecules of any size and composition
(given an appropriate training dataset is available) without significant increase in computational resources required.

## Key approaches

### Unification of adjacency matrix representation

The primary challenge in generating the adjacency matrix is that it is not an invariant for a given graph.
For a given graph with G nodes, there are G! adjacency matrices that can describe its connectivity.
To tackle this issue, the adjacency matrix representation needs to be unified. Typically, in the machine- readable
representation of a molecule, its atoms are stored in the first-depth-tree traversal order.
While this order contains information about the stored structure, it cannot be easily reconstructed when only
the elemental composition of the molecule and the isotropic shielding constant for each atom are known.
Since the shielding constant provides a unique characterization of an atom's chemical environment, it can be
employed to standardize the representation of the adjacency matrix in conjunction with element information.

### Generic adjacency matrix

The architecture of the Structure Seer model bears similarities to other GCN-based models used for diverse tasks
involving molecular graphs. However, its distinctive design is centred around encoding the molecule
solely based on node labelling, which allows for the generation of the complete adjacency matrix.
This feature makes the considered architecture applicable to a broad range of atom adjacency reconstruction tasks.

## Training

Refer to the training procedure in the Jupyter notebook ```./training.ipynb``` .
Customize the procedure by adjusting the global variables in the second code cell.
The main training function source code is in ```./training/train_model.py```.

In order to train the model using Google Colab - extract the repository to the GDrive into ```./MyDrive```.

## Evaluation

For model evaluation, utilize ```./model_evaluation.ipynb``` with the pretrained model weights.
Small example datasets for detailed model evaluation are provided in ```./example_dataset```.

## Code examples

Explore model usage and functionality in ```./structure_seer_code_examples.ipynb```,
which includes illustrative examples.



1 change: 1 addition & 0 deletions data_preparation/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

48 changes: 48 additions & 0 deletions data_preparation/append_shielding_to_sd.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import logging
import os

from rdkit import Chem

from utils import (
is_successful_orca_run,
nmr_shielding_from_out_file,
orca_output_file_check,
read_sdf_compounds,
)

"""
This script can be used to extract calculated
shielding constants from .out files and append them to corresponding structures in the .sd file
"""

PATH_TO_SD = "../data/structures.sdf"
INPUT_FOLDER = "../data"
CALC_TYPE = "NMR"

logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)

compounds = read_sdf_compounds(PATH_TO_SD)

input_dir = os.listdir(INPUT_FOLDER)
compound_nmrs = dict()

# Check if folder contains corresponding .out file and ORCA terminated normally
for folder in input_dir:
if orca_output_file_check(
path=INPUT_FOLDER, compound_id=folder, calc_type=CALC_TYPE
):
logging.info(f"{input_dir.index(folder)} out of {len(input_dir)} processed")
# Parse NMR Shielding constants from .out file
nmr = nmr_shielding_from_out_file(
path_to_out_file=f"{INPUT_FOLDER}/{folder}/{folder}_{CALC_TYPE}.out"
)
compound_nmrs[folder] = "; ".join([str(x) for x in nmr])

n = len(compound_nmrs.keys())
with Chem.SDWriter(f"{INPUT_FOLDER}/{n}_qm9_structures_HF-3c_shielding.sdf") as w:
for compound in compounds:
name = compound.GetProp("_Name")
if name in compound_nmrs.keys():
compound.SetProp("Shielding", str(compound_nmrs[name]))
logging.info(f"Writing compound {compounds.index(compound)} of {n}")
w.write(compound)
54 changes: 54 additions & 0 deletions data_preparation/dataset_characterisation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import logging
from collections import defaultdict

import torch

from utils import MolecularDataset

"""
Current script was used to perform characterisation of the datasets
"""

logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)

dataset = MolecularDataset(
"./example_datasets/demo_compounds_qm9.sdf",
absolute_norm=True,
shielding_sort=True,
)

test_loader = torch.utils.data.DataLoader(
dataset,
batch_size=1,
shuffle=False,
drop_last=False,
)

atom_nums_count = defaultdict(int)
bonds_count = defaultdict(int)

for item in test_loader:
adj_matrix = item["adjacency_matrix"]

atoms_matrix = [
item["atoms_matrix"][0],
item["atoms_matrix"][1],
]

atoms_num = torch.count_nonzero(atoms_matrix[0], dim=-1).item()
num_bonds = torch.sum(
torch.count_nonzero(torch.argmax(adj_matrix, dim=3), dim=1), dim=1
).item()

if atoms_num in atom_nums_count.keys():
atom_nums_count[atoms_num] += 1
else:
atom_nums_count[atoms_num] = 1

if num_bonds in bonds_count:
bonds_count[num_bonds] += 1
else:
bonds_count[num_bonds] = 1

logging.info(f"Bonds \n {bonds_count}")
logging.info(f"Atoms \n {atom_nums_count}")
41 changes: 41 additions & 0 deletions data_preparation/generate_dft_input.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import logging

from tqdm import tqdm

from utils import batch, generate_input_file, read_sdf_compounds

"""
Generates a directory with folders containing corresponding ORCA input file
for each compound from .sd file specified
"""

# Indicate path to the .sd file, containing compounds
IN_BATCHES = True
BATCH_SIZE = 45000
PATH_TO_STRUCTURES_SD = "./example_datasets/demo_compounds_pubchem.sdf"
INPUT_FOLDER_PATH = "./Input_folder_path"
CALCULATION_TYPE = "NMR"

logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)

total_compounds = read_sdf_compounds(path=PATH_TO_STRUCTURES_SD)

if IN_BATCHES:
batches = []
for single_batch in batch(total_compounds, BATCH_SIZE):
batches.append(single_batch[1])
for i in range(len(batches)):
logging.info(f"Preparing batch {i+1}")
for j in tqdm(range(len(batches[i]))):
compound = batches[i][j]
generate_input_file(
compound=compound,
calc_type=CALCULATION_TYPE,
save_path=f"{INPUT_FOLDER_PATH}/{len(batches[i])}_{CALCULATION_TYPE}_batch_{i+1}",
)

else:
for compound in total_compounds:
generate_input_file(
compound=compound, calc_type=CALCULATION_TYPE, save_path=INPUT_FOLDER_PATH
)
Loading

0 comments on commit 23b13d1

Please sign in to comment.