Template cache searches are too slow #302

mark-mackey-cresset · 2023-09-12T12:53:19Z

Currently, when a molecule is passed to a template generator it's looked for in the cache (if one is provided). The algorithm for doing this involves creating a Molecule object for each SMILES string in the cache, and then comparing its topology to the molecule being parameterised. In our benchmarks, this means that only around 10 or so cache entries are checked per second, as Molecule creation is expensive. If the cache has been in use for a while, it may well have several thousand entries in it, which means that checking the cache can take a lot more time than just doing the parameterisation (even including the AM1-BCC compute time).

Given that this is just an identity check, I can't see any reason why at least the initial check on the cache isn't just a SMILES string comparison, which would speed things up by several orders of magnitude.

jchodera · 2023-10-26T09:24:25Z

This is a great idea! Apologies it's taken us so long to address.

@mikemhenry : Are you able to take a stab at this?

mark-mackey-cresset · 2023-10-26T10:45:55Z

The only possible fly in the ointment is going to be choosing whose unique SMILES string to use. I imagine most users are either using RDkit or OE exclusively, but if anyone's switching you could end up with cache keys that are a mixture of SMILES from the two, which is unlikely to be useful :). I think you'll have to store which toolkit a given key was generated with, and for entries that don't have a key for the current toolkit fall back to the current slow compare-the-Molecule-object code. Alternatively use InCHI for the keys, but that's might require brining in an extra dependency?

jchodera · 2023-10-26T14:42:47Z

Great point.

We should use the OpenFF Molecule.to_smiles() to generate the SMILES string for lookup, which will produce a canonical isomeric SMILES representation that we can cache internally and reproducibly---at least as long as they are still using the same toolkit backend.

Even if someone ends up using the same cache with different toolkit backends, at worst, you'll get a separate copy of each molecule for each toolkit backend, both both should be correct, which will only slow things down a bit.

mikemhenry · 2023-10-31T21:26:29Z

Can do, I think using Molecule.to_smiles() does make the most sense for the key. I've ran into issue juggling different chem informatic toolkits so I think sticking to OpenFF makes the most sense.

mikemhenry · 2023-11-14T23:04:35Z

Just wanted to add that this hasn't fallen off my radar, I am working on a benchmark script so I can measure the speed-up

mattwthompson · 2023-11-17T16:46:56Z

You might want to consider mapped SMILES or even CMILES as the hash, especially if you end up using @functools.lru_cache. We recently ran into a critical-yet-esoteric bug that mapped SMILES would have safeguarded against.

jchodera · 2023-11-19T18:58:05Z

Great suggestion, @mattwthompson !

jchodera added enhancement high-priority labels Oct 26, 2023

mikemhenry self-assigned this Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Template cache searches are too slow #302

Template cache searches are too slow #302

mark-mackey-cresset commented Sep 12, 2023

jchodera commented Oct 26, 2023

mark-mackey-cresset commented Oct 26, 2023

jchodera commented Oct 26, 2023 •

edited

Loading

mikemhenry commented Oct 31, 2023

mikemhenry commented Nov 14, 2023

mattwthompson commented Nov 17, 2023

jchodera commented Nov 19, 2023

Template cache searches are too slow #302

Template cache searches are too slow #302

Comments

mark-mackey-cresset commented Sep 12, 2023

jchodera commented Oct 26, 2023

mark-mackey-cresset commented Oct 26, 2023

jchodera commented Oct 26, 2023 • edited Loading

mikemhenry commented Oct 31, 2023

mikemhenry commented Nov 14, 2023

mattwthompson commented Nov 17, 2023

jchodera commented Nov 19, 2023

jchodera commented Oct 26, 2023 •

edited

Loading