-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template cache searches are too slow #302
Comments
This is a great idea! Apologies it's taken us so long to address. @mikemhenry : Are you able to take a stab at this? |
The only possible fly in the ointment is going to be choosing whose unique SMILES string to use. I imagine most users are either using RDkit or OE exclusively, but if anyone's switching you could end up with cache keys that are a mixture of SMILES from the two, which is unlikely to be useful :). I think you'll have to store which toolkit a given key was generated with, and for entries that don't have a key for the current toolkit fall back to the current slow compare-the-Molecule-object code. Alternatively use InCHI for the keys, but that's might require brining in an extra dependency? |
Great point. We should use the OpenFF Even if someone ends up using the same cache with different toolkit backends, at worst, you'll get a separate copy of each molecule for each toolkit backend, both both should be correct, which will only slow things down a bit. |
Can do, I think using |
Just wanted to add that this hasn't fallen off my radar, I am working on a benchmark script so I can measure the speed-up |
You might want to consider mapped SMILES or even CMILES as the hash, especially if you end up using |
Great suggestion, @mattwthompson ! |
Currently, when a molecule is passed to a template generator it's looked for in the cache (if one is provided). The algorithm for doing this involves creating a Molecule object for each SMILES string in the cache, and then comparing its topology to the molecule being parameterised. In our benchmarks, this means that only around 10 or so cache entries are checked per second, as Molecule creation is expensive. If the cache has been in use for a while, it may well have several thousand entries in it, which means that checking the cache can take a lot more time than just doing the parameterisation (even including the AM1-BCC compute time).
Given that this is just an identity check, I can't see any reason why at least the initial check on the cache isn't just a SMILES string comparison, which would speed things up by several orders of magnitude.
The text was updated successfully, but these errors were encountered: