-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaNs when using torchani #50
Comments
Things I would try first:
If you can provide an example I am happy to take a look |
Here's a bit more information.
It might be related to the pytorch version. I have two environments. In the first environment, all packages are installed with conda. In that environment, the problem always happens. Here are the versions of the most relevant packages.
In the other environment, all the OpenMM related packages are installed from source. In that environment the problem usually does not happen. I have occasionally seen NaNs in it, but they're much less frequent. Here are the package versions.
I tried to downgrade pytorch to 1.11 in the first environment to see if that would fix the problem, but I'm getting version conflicts. |
I don’t know if this is related to your problem, but when using |
That doesn't sound related. In my case, the error only happens if we use TorchANI instead of NNPOps to implement the TorchForce. |
Are you using NNPOps before or after this PR: openmm/NNPOps#83 ? |
The problem isn't in NNPOps. It works correctly. The error happens when using torchani instead. I'm making progress toward narrowing it down. Here is my current simplest code for reproducing it. It creates a mixed system, evaluates the forces, and then immediately evaluates the forces again. When using torchani 2.2.2 and pytorch 1.13.1, they come out different. The error requires that only a small part of the system is modeled with ML. In this script I have 2000 atoms, with only 50 being ML. If I reduce it to 1000 atoms it works correctly. The error requires there to be a NonbondedForce in the system, and for it to use a cutoff. It does not need to use periodic boundary conditions, though. from openmm import *
from openmm.app import *
from openmm.unit import *
from openmmml import MLPotential
import numpy as np
potential = MLPotential('ani2x')
numParticles = 2000
topology = Topology()
chain = topology.addChain()
residue = topology.addResidue('UNK', chain)
system = System()
nb = NonbondedForce()
nb.setNonbondedMethod(NonbondedForce.CutoffNonPeriodic)
system.addForce(nb)
elements = np.random.choice([element.hydrogen, element.carbon, element.nitrogen, element.oxygen], numParticles)
for i in range(numParticles):
system.addParticle(1.0)
nb.addParticle(0, 1, 0)
topology.addAtom(f'{i}', elements[i], residue)
pos = np.random.random((numParticles, 3))
ml_atoms = list(range(numParticles-50, numParticles))
system2 = potential.createMixedSystem(topology, system, ml_atoms, implementation='torchani')
integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.002*picoseconds)
context = Context(system2, integrator)
context.setPositions(pos)
f1 = context.getState(getForces=True).getForces(asNumpy=True)._value
f2 = context.getState(getForces=True).getForces(asNumpy=True)._value
for i in ml_atoms:
print(f1[i], f2[i]) |
This seems to be a bug in PyTorch 1.13. The JIT profile guided optimisation does something to the torchani from openmm import *
from openmm.app import *
from openmm.unit import *
from openmmml import MLPotential
import numpy as np
import torch
# make the test system
potential = MLPotential('ani2x')
numParticles = 2000
topology = Topology()
chain = topology.addChain()
residue = topology.addResidue('UNK', chain)
system = System()
nb = NonbondedForce()
nb.setNonbondedMethod(NonbondedForce.CutoffNonPeriodic)
system.addForce(nb)
elements = np.random.choice([element.hydrogen, element.carbon, element.nitrogen, element.oxygen], numParticles)
for i in range(numParticles):
system.addParticle(1.0)
nb.addParticle(0, 1, 0)
topology.addAtom(f'{i}', elements[i], residue)
pos = np.random.random((numParticles, 3))
ml_atoms = list(range(numParticles-50, numParticles))
system2 = potential.createMixedSystem(topology, system, ml_atoms, implementation='torchani')
# load the pytorch model
# CPU version for reference forces
model_cpu = torch.jit.load("animodel.pt", map_location="cpu")
pos_cpu = torch.tensor(pos, requires_grad=True, dtype=torch.float32, device="cpu")
e_cpu = model_cpu(pos_cpu)
e_cpu.backward()
f_cpu = -pos_cpu.grad
# load in a CUDA version of the model
model_cuda = torch.jit.load("animodel.pt",map_location="cuda")
# turn on JIT profile guided optimizations (These will be on by default I think)
torch._C._jit_set_profiling_executor(True)
torch._C._jit_set_profiling_mode(True)
# later calls (the optimised ones) will fail to compute correct forces
forces_cuda_jitopt = []
N=5 # num reps
for n in range(N):
pos_cuda = torch.tensor(pos, requires_grad=True, dtype=torch.float32, device="cuda")
e_cuda = model_cuda(pos_cuda)
e_cuda.backward()
f_cuda = -pos_cuda.grad
forces_cuda_jitopt.append(f_cuda.cpu().numpy())
# compare
print("compare cuda forces with JIT profile guided optimization enabled")
for n in range(N):
if np.allclose(f_cpu, forces_cuda_jitopt[n], rtol=1e-3):
print("n =",n, "forces are correct")
else:
print("n = ",n, "forces are wrong!")
# now do the same but disable JIT profile guided optimisations
# load in a CUDA version of the model
model_cuda = torch.jit.load("animodel.pt",map_location="cuda")
torch._C._jit_set_profiling_executor(False)
torch._C._jit_set_profiling_mode(False)
forces_cuda = []
N=5 # num reps
for n in range(N):
pos_cuda = torch.tensor(pos, requires_grad=True, dtype=torch.float32, device="cuda")
e_cuda = model_cuda(pos_cuda)
e_cuda.backward()
f_cuda = -pos_cuda.grad
forces_cuda.append(f_cuda.cpu().numpy())
# compare
print("compare cuda forces with JIT profile guided optimization disabled")
for n in range(N):
if np.allclose(f_cpu, forces_cuda[n], rtol=1e-3):
print("n =",n, "forces are correct")
else:
print("n = ",n, "forces are wrong!")
The output I get on a RTX3090 with Pytorch 1.13.1 and CUDA 11.7 is this:
|
This seems to be fixable for me by changing a @peastman do you get correct forces if you use my fork of torchani? to install in an existing environment:
|
Your fix works for me. Fantastic work tracking this down! Hopefully they'll release an update soon. |
recommended workaround is turning off NVFuser aiqm/torchani#628 (comment) (I don't know why changing the ** to a float_power seems to fix it )
This is the relevant pytorch issue: pytorch/pytorch#84510 |
According to this comment pytorch/pytorch#84510 (comment) NVFuser is being replaced by NNC. These means in future PyTorch releases the default TorchScript settings will be to use NNC, but for the current pytorch 2.0 we will need to tell people to switch NVFuser to NNC if they want to use Torchani without NNPOps. |
I'm running simulations of mixed ML/MM systems where part is computed with ANI-2x and part with Amber. As long as I specify
implementation='nnpops'
in the call tocreateMixedSystem()
it works well. But if I specifyimplementation='torchani'
, the simulation immediately blows up with NaN coordinates. I tried a few molecules and the result is the same for all of them.Does anyone have an idea what could be causing this? I can put together a test case to reproduce the problem, if that's helpful. My current system is too big to post here. Here are the relevant packages from my conda environment.
The text was updated successfully, but these errors were encountered: