Merge pull request #1 from quantori/dev

Raw code PR
quantori · Jan 11, 2024 · 23b13d1 · 23b13d1
2 parents 462c5bc + 10ee238
commit 23b13d1
Show file tree

Hide file tree

Showing 46 changed files with 131,738 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,10 @@
+data/*
+.DS_Store
+**/.DS_Store
+.idea/
+__pycache__/
+wandb/*
+colab_training/*
+/.ipynb_checkpoints/
+/venv
+/training_log/
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,22 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+authors:
+- family-names: "Sapegin"
+  given-names: "Denis Andzheevich"
+  orcid: "https://orcid.org/0000-0002-1446-6288"
+title: "Structure Seer"
+version: 1.0
+date-released: 2022-07-18
+url: "https://github.com/quantori/structure-seer"
+preferred-citation:
+  type: article
+  authors:
+  - family-names: "Sapegin"
+    given-names: "Denis A"
+    orcid: "https://orcid.org/0000-0002-1446-6288"
+  - family-names: "Bear"
+    given-names: "Joseph C"
+  doi: "10.1039/D3DD00178D",
+  journal: "Digital Discovery"
+  title: "Structure Seer – a machine learning model for chemical structure elucidation from node labelling of a molecular graph"
+  year: 2024
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,76 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+  advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
diff --git a/DEVNOTES.md b/DEVNOTES.md
@@ -0,0 +1,15 @@
+## Installation
+``` 
+pip install -r requirements.txt
+ ```
+The SCF calculation were performed using [ORCA](https://orcaforum.kofo.mpg.de/app.php/portal).
+To use quantum-chemistry calculations, install ORCA and specify its global path
+in a corresponding environment variable in ```./data_preparation/parallel_dft_calculation.py```.
+
+## Repository structure
+
+- Source code for models is located in ``` ./models ```.
+- Model weights trained on QM9 and PubChem Datasets are stored in ```./weights```.
+- Utility classes and functions are provided in ```./utils```.
+- Scripts for data preparation and SCF calculations can be found in ```./data_preparation```.
+- Parallel jobs for ORCA calculations can be run from ```./data_preparation/parallel_dft_calculation.py```
diff --git a/LICENSE b/LICENSE
@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright [yyyy] [name of copyright owner]
+   Copyright 2023 Quantori, Denis Sapegin
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

diff --git a/NOTICE.txt b/NOTICE.txt
@@ -0,0 +1,14 @@
+Quantori Implementation for Structure Seer model
+Copyright (C) 2023 Quantori, Denis Sapegin
+This product was designed for academic purposes by Denis Sapegin;
+The quantum mechanics calculations necessary for dataset preparation depend on the
+ORCA (https://orcaforum.kofo.mpg.de/app.php/portal) software, developed by Max-Planck-Institute für Kohlenforschung,
+under End User License Agreement (EULA) for the ORCA software (https://orcaforum.kofo.mpg.de/app.php/portal)
+
+
+In addition, this product contains dependencies on files licensed under:
+
+Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0
+The MIT License https://opensource.org/licenses/MIT
+The 3-Clause BSD License https://opensource.org/licenses/BSD-3-Clause
+GNU Affero General Public License v3.0 https://www.gnu.org/licenses/agpl-3.0.en.html
diff --git a/README.md b/README.md
@@ -1 +1,68 @@
-# structure-seer
+[![DOI:10.1039/D3DD00178D](http://img.shields.io/badge/DOI-10.1039/D3DD00178D-ebe534.svg)](https://doi.org/10.1039/D3DD00178D)
+
+![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white)
+# Structure Seer 
+
+The implementation training and evaluation of a Structure Seer model designed for
+reconstruction of adjacency of a molecular graph from the labelling of its nodes.
+The detailed characterisation and disclosure of the model architecture is provided in:
+[Structure Seer - a machine learning model for chemical structure elucidation
+from a node labelling of a molecular graph, Digital discovery, 2023](https://doi.org/10.1039/D3DD00178D)
+
+## Datasets
+
+The repository does not contain initial datasets used for training. 
+- Small example datasets for detailed model evaluation are provided in ```./example_dataset```
+- Model weights trained on QM9 and PubChem Datasets are stored in ```./weights```
+
+## Abstract
+
+The repository contains the implementation for a novel graph convolution based machine-learning model which
+is designed to provide a quantitative probabilistic prediction on the connectivity of the atoms based on the
+information on the elemental composition of the molecule along with a list of atom-attributed isotropic shielding
+constants. The suggested approach holds significant potential for scalability, as it can harness vast amounts
+of information on known chemical structures for the model's learning process. The model architecture allows for 
+direct structure reconstruction through prediction of molecular graph adjacency based solely on the
+labelling of its nodes, which potentially allows dealing with molecules of any size and composition
+(given an appropriate training dataset is available) without significant increase in computational resources required. 					
+
+## Key approaches
+
+### Unification of adjacency matrix representation
+
+The primary challenge in generating the adjacency matrix is that it is not an invariant for a given graph.
+For a given graph with G nodes, there are G! adjacency matrices that can describe its connectivity.
+To tackle this issue, the adjacency matrix representation needs to be unified. Typically, in the machine- readable
+representation of a molecule, its atoms are stored in the first-depth-tree traversal order. 
+While this order contains information about the stored structure, it cannot be easily reconstructed when only
+the elemental composition of the molecule and the isotropic shielding constant for each atom are known. 
+Since the shielding constant provides a unique characterization of an atom's chemical environment, it can be
+employed to standardize the representation of the adjacency matrix in conjunction with element information.
+
+### Generic adjacency matrix
+
+The architecture of the Structure Seer model bears similarities to other GCN-based models used for diverse tasks
+involving molecular graphs. However, its distinctive design is centred around encoding the molecule
+solely based on node labelling, which allows for the generation of the complete adjacency matrix.
+This feature makes the considered architecture applicable to a broad range of atom adjacency reconstruction tasks.
+
+## Training
+
+Refer to the training procedure in the Jupyter notebook ```./training.ipynb``` . 
+Customize the procedure by adjusting the global variables in the second code cell.
+The main training function source code is in ```./training/train_model.py```.
+
+In order to train the model using Google Colab - extract the repository to the GDrive into ```./MyDrive```.
+
+## Evaluation
+
+For model evaluation, utilize ```./model_evaluation.ipynb``` with the pretrained model weights.
+Small example datasets for detailed model evaluation are provided in ```./example_dataset```.
+
+## Code examples
+
+Explore model usage and functionality in ```./structure_seer_code_examples.ipynb```,
+which includes illustrative examples.
+
+
+
diff --git a/data_preparation/__init__.py b/data_preparation/__init__.py
@@ -0,0 +1 @@
+
diff --git a/data_preparation/append_shielding_to_sd.py b/data_preparation/append_shielding_to_sd.py
@@ -0,0 +1,48 @@
+import logging
+import os
+
+from rdkit import Chem
+
+from utils import (
+    is_successful_orca_run,
+    nmr_shielding_from_out_file,
+    orca_output_file_check,
+    read_sdf_compounds,
+)
+
+"""
+This script can be used to extract calculated
+shielding constants from .out files and append them to corresponding structures in the .sd file
+"""
+
+PATH_TO_SD = "../data/structures.sdf"
+INPUT_FOLDER = "../data"
+CALC_TYPE = "NMR"
+
+logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)
+
+compounds = read_sdf_compounds(PATH_TO_SD)
+
+input_dir = os.listdir(INPUT_FOLDER)
+compound_nmrs = dict()
+
+# Check if folder contains corresponding .out file and ORCA terminated normally
+for folder in input_dir:
+    if orca_output_file_check(
+        path=INPUT_FOLDER, compound_id=folder, calc_type=CALC_TYPE
+    ):
+        logging.info(f"{input_dir.index(folder)} out of {len(input_dir)} processed")
+        # Parse NMR Shielding constants from .out file
+        nmr = nmr_shielding_from_out_file(
+            path_to_out_file=f"{INPUT_FOLDER}/{folder}/{folder}_{CALC_TYPE}.out"
+        )
+        compound_nmrs[folder] = "; ".join([str(x) for x in nmr])
+
+n = len(compound_nmrs.keys())
+with Chem.SDWriter(f"{INPUT_FOLDER}/{n}_qm9_structures_HF-3c_shielding.sdf") as w:
+    for compound in compounds:
+        name = compound.GetProp("_Name")
+        if name in compound_nmrs.keys():
+            compound.SetProp("Shielding", str(compound_nmrs[name]))
+            logging.info(f"Writing compound {compounds.index(compound)} of {n}")
+            w.write(compound)
diff --git a/data_preparation/dataset_characterisation.py b/data_preparation/dataset_characterisation.py
@@ -0,0 +1,54 @@
+import logging
+from collections import defaultdict
+
+import torch
+
+from utils import MolecularDataset
+
+"""
+Current script was used to perform characterisation of the datasets
+"""
+
+logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)
+
+dataset = MolecularDataset(
+    "./example_datasets/demo_compounds_qm9.sdf",
+    absolute_norm=True,
+    shielding_sort=True,
+)
+
+test_loader = torch.utils.data.DataLoader(
+    dataset,
+    batch_size=1,
+    shuffle=False,
+    drop_last=False,
+)
+
+atom_nums_count = defaultdict(int)
+bonds_count = defaultdict(int)
+
+for item in test_loader:
+    adj_matrix = item["adjacency_matrix"]
+
+    atoms_matrix = [
+        item["atoms_matrix"][0],
+        item["atoms_matrix"][1],
+    ]
+
+    atoms_num = torch.count_nonzero(atoms_matrix[0], dim=-1).item()
+    num_bonds = torch.sum(
+        torch.count_nonzero(torch.argmax(adj_matrix, dim=3), dim=1), dim=1
+    ).item()
+
+    if atoms_num in atom_nums_count.keys():
+        atom_nums_count[atoms_num] += 1
+    else:
+        atom_nums_count[atoms_num] = 1
+
+    if num_bonds in bonds_count:
+        bonds_count[num_bonds] += 1
+    else:
+        bonds_count[num_bonds] = 1
+
+logging.info(f"Bonds \n {bonds_count}")
+logging.info(f"Atoms \n {atom_nums_count}")
diff --git a/data_preparation/generate_dft_input.py b/data_preparation/generate_dft_input.py
@@ -0,0 +1,41 @@
+import logging
+
+from tqdm import tqdm
+
+from utils import batch, generate_input_file, read_sdf_compounds
+
+"""
+Generates a directory with folders containing corresponding ORCA input file
+for each compound from .sd file specified
+"""
+
+# Indicate path to the .sd file, containing compounds
+IN_BATCHES = True
+BATCH_SIZE = 45000
+PATH_TO_STRUCTURES_SD = "./example_datasets/demo_compounds_pubchem.sdf"
+INPUT_FOLDER_PATH = "./Input_folder_path"
+CALCULATION_TYPE = "NMR"
+
+logging.basicConfig(format="%(levelname)s:%(message)s", level=logging.INFO)
+
+total_compounds = read_sdf_compounds(path=PATH_TO_STRUCTURES_SD)
+
+if IN_BATCHES:
+    batches = []
+    for single_batch in batch(total_compounds, BATCH_SIZE):
+        batches.append(single_batch[1])
+    for i in range(len(batches)):
+        logging.info(f"Preparing batch {i+1}")
+        for j in tqdm(range(len(batches[i]))):
+            compound = batches[i][j]
+            generate_input_file(
+                compound=compound,
+                calc_type=CALCULATION_TYPE,
+                save_path=f"{INPUT_FOLDER_PATH}/{len(batches[i])}_{CALCULATION_TYPE}_batch_{i+1}",
+            )
+
+else:
+    for compound in total_compounds:
+        generate_input_file(
+            compound=compound, calc_type=CALCULATION_TYPE, save_path=INPUT_FOLDER_PATH
+        )