You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A comment was added after creating this issue: There was a relevant issue before. See #1027.
1. Bug Description
The bug was identified by the witness of @jonwzheng and @hwpang when searching thermo for C1=CC2C=CC=1C=C2, the result was non-deterministic (try refreshing the page). The issue was first mentioned in #2490. However, this is not actually a decision tree issue, so I decided to make it a separate PR.
2. Reproducing the issue with RMG APIs
import os.path as osp
import rmgpy
from rmgpy.molecule.molecule import Molecule
from rmgpy.data.thermo import ThermoDatabase
tmdb = ThermoDatabase()
tmdb.load(osp.join(rmgpy.settings["database.directory"], "thermo"))
smi = "C1=CC2C=CC=1C=C2"
mol = Molecule().from_smiles(smi)
print(tmdb.compute_group_additivity_thermo(spc.molecule[0]).comment.split(":")[2])
The above block should be the minimal code example to reproduce the issue. You may see different thermo estimation comments from the execution. I found that after import is done, the result is deterministic. So In order to get different results, one needs to restart the kernel and import the module before actually calling the function.
3. Investigate get_deterministic_sssr
Then, I would like to briefly explain the results and the potential cause of its undeterminism.
I trace down the function calls and believe the real cause is that get_deterministic_sssr is not deterministic, at least for the investigating molecule.
I typically see two kinds of results out of get_deterministic_sssr:
SSSR result 1: [[0, 4, 5, 7, 6, 2], [2, 1, 3, 5, 7, 6]]
The corresponding molecule graph:
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)
So both of the two rings identified involve three double bonds ([0,4],[5,7], [7,6]) and ([1,3],[5,7],[7,6])
SSSR result 2: [[2, 1, 4, 3, 0, 5], [0, 3, 4, 1, 7, 6]]
The corresponding molecule graph:
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)
Only the first ring has three double bonds ([2,1],[4,3], [5,2]). The second ring only has two double bonds ([3,4],[7,6]).
Within get_deterministic_sssr, the different results are due to nondeterminism when choosing the so-called root_vertex:
Basically, the root vertexes are iteratively picked according to the connectivity value (get_vertex_connectivity_value), and connectivity values do not take bond orders into account (basically convolutionally counting the number of neighbors). Like the molecule in this example, the atoms on the bridge share the same values for a symmetric skeleton. Therefore, the sorting does not distinguish the atom with/without double bonds...
Then, In the case where the first picked root vertex is on the bridge involving two double bonds (e.g., carbon 7 or 8, see figure below), the smallest ring involving this bridge has 3 double bonds. However, this bridge will be removed after finding the smallest ring, causing the second ring to only have 2 double bonds instead of 3; instead, if the first picked root vertex is on the bridge involving only one double bond (e.g., carbon 1,2,4,5), then both smallest ring will have 3 double bonds, as the algorithm will maximize the total bond order of the ring
In the docstring, get_smallest_set_of_smallest_rings is recommended in "new" applications. However, a preliminary test shows it still suffers from the same issue.
I've also considered adding more sorting keys when choosing root vertex, e.g., considering bond orders. But couldn't figure out a good way to really avoid determinism...
It is possible to use get_relevant_cycles, which is deterministic, and choose a set of rings from the relevant cycles. It involves a more complicated rewrite and possibly redesigning the algorithm, and we still need to be cautious when selecting the rings.
The text was updated successfully, but these errors were encountered:
A comment was added after creating this issue: There was a relevant issue before. See #1027.
1. Bug Description
The bug was identified by the witness of @jonwzheng and @hwpang when searching thermo for
C1=CC2C=CC=1C=C2
, the result was non-deterministic (try refreshing the page). The issue was first mentioned in #2490. However, this is not actually a decision tree issue, so I decided to make it a separate PR.2. Reproducing the issue with RMG APIs
The above block should be the minimal code example to reproduce the issue. You may see different thermo estimation comments from the execution. I found that after
import
is done, the result is deterministic. So In order to get different results, one needs to restart the kernel and import the module before actually calling the function.3. Investigate
get_deterministic_sssr
Then, I would like to briefly explain the results and the potential cause of its undeterminism.
I trace down the function calls and believe the real cause is that
get_deterministic_sssr
is not deterministic, at least for the investigating molecule.The call stack is:
within
ThermoDatabase
get_all_thermo_data
>get_thermo_data_from_groups
>estimate_thermo_via_group_additivity
>compute_group_additivity_thermo
>_add_polycyclic_correction_thermo_data
>get_bicyclic_correction_thermo_data_from_heuristic
>split_bicyclic_into_single_rings
>bicyclic_submol.get_deterministic_sssr()
I typically see two kinds of results out of
get_deterministic_sssr
:SSSR result 1:
[[0, 4, 5, 7, 6, 2], [2, 1, 3, 5, 7, 6]]
The corresponding molecule graph:
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)
So both of the two rings identified involve three double bonds ([0,4],[5,7], [7,6]) and ([1,3],[5,7],[7,6])
SSSR result 2:
[[2, 1, 4, 3, 0, 5], [0, 3, 4, 1, 7, 6]]
The corresponding molecule graph:
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)
Only the first ring has three double bonds ([2,1],[4,3], [5,2]). The second ring only has two double bonds ([3,4],[7,6]).
Within
get_deterministic_sssr
, the different results are due to nondeterminism when choosing the so-calledroot_vertex
:RMG-Py/rmgpy/molecule/molecule.py
Lines 2608 to 2613 in 3b0b82f
The returned value by
sorted(root_candidates_tups, key=lambda tup0: tup0[1:], reverse=True)
Basically, the root vertexes are iteratively picked according to the connectivity value (
get_vertex_connectivity_value
), and connectivity values do not take bond orders into account (basically convolutionally counting the number of neighbors). Like the molecule in this example, the atoms on the bridge share the same values for a symmetric skeleton. Therefore, the sorting does not distinguish the atom with/without double bonds...Then, In the case where the first picked root vertex is on the bridge involving two double bonds (e.g., carbon 7 or 8, see figure below), the smallest ring involving this bridge has 3 double bonds. However, this bridge will be removed after finding the smallest ring, causing the second ring to only have 2 double bonds instead of 3; instead, if the first picked root vertex is on the bridge involving only one double bond (e.g., carbon 1,2,4,5), then both smallest ring will have 3 double bonds, as the algorithm will maximize the total bond order of the ring
RMG-Py/rmgpy/molecule/molecule.py
Lines 2622 to 2632 in 3b0b82f
4. Workaround
In the docstring,
get_smallest_set_of_smallest_rings
is recommended in "new" applications. However, a preliminary test shows it still suffers from the same issue.I've also considered adding more sorting keys when choosing root vertex, e.g., considering bond orders. But couldn't figure out a good way to really avoid determinism...
It is possible to use
get_relevant_cycles
, which is deterministic, and choose a set of rings from the relevant cycles. It involves a more complicated rewrite and possibly redesigning the algorithm, and we still need to be cautious when selecting the rings.The text was updated successfully, but these errors were encountered: