Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allegro_pair style and empty partitions #45

Open
nec4 opened this issue Jun 3, 2024 · 7 comments
Open

allegro_pair style and empty partitions #45

nec4 opened this issue Jun 3, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@nec4
Copy link

nec4 commented Jun 3, 2024

Hello all,

Very happy with the tools - thank you for maintaining them and integrating with LAMMPS. I am running some vacuum simulations and while increasing the number of GPUs (and mpi ranks), I ran into the following issue:

terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/nequip/nn/_grad_output.py", line 32, in forward
        _6 = torch.append(wrt_tensors, data[k])
      func0 = self.func
      data0 = (func0).forward(data, )
               ~~~~~~~~~~~~~~ <--- HERE
      of = self.of
      _7 = [torch.sum(data0[of])]
  File "code/__torch__/nequip/nn/_graph_mixin.py", line 30, in AD_logsumexp_backward
    input1 = (radial_basis).forward(input0, )
    input2 = (spharm).forward(input1, )
    input3 = (allegro).forward(input2, )
              ~~~~~~~~~~~~~~~~ <--- HERE
    input4 = (edge_eng).forward(input3, )
    input5 = (edge_eng_sum).forward(input4, )
  File "code/__torch__/allegro/nn/_allegro.py", line 168, in mean_0
    _n_scalar_outs = self._n_scalar_outs
    _38 = torch.slice(_37, 2, None, _n_scalar_outs[0])
    scalars = torch.reshape(_38, [(torch.size(features1))[0], -1])
              ~~~~~~~~~~~~~ <--- HERE
    features2 = (_04).forward(features1, )
    _39 = annotate(List[Optional[Tensor]], [active_edges0])

Traceback of TorchScript, original code (most recent call last):
  File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/nequip/nn/_grad_output.py", line 84, in forward
            wrt_tensors.append(data[k])
        # run func
        data = self.func(data)
               ~~~~~~~~~ <--- HERE
        # Get grads
        grads = torch.autograd.grad(
  File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/nequip/nn/_graph_mixin.py", line 366, in AD_logsumexp_backward
    def forward(self, input: AtomicDataDict.Type) -> AtomicDataDict.Type:
        for module in self:
            input = module(input)
                    ~~~~~~ <--- HERE
        return input
  File "/home/bepnickc/miniconda3/envs/allegro_and_lammps/lib/python3.9/site-packages/allegro/nn/_allegro.py", line 585, in mean_0
            # features has shape [z][mul][k]
            # we know scalars are first
            scalars = features[:, :, : self._n_scalar_outs[layer_index]].reshape(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                features.shape[0], -1
            )
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

Seems that its related to the way LAMMPS partitions the simulation box with increasing number of processes:

https://docs.lammps.org/Developer_par_part.html

Where if the rank/partition running that model contains no atoms, the network potential cannot accept the zero size tensor. Is this the intended behaviour of allegro models in LAMMPS?

Happy to run some more tests/provide more information if needed.

@anjohan anjohan added the bug Something isn't working label Jun 4, 2024
@anjohan
Copy link
Collaborator

anjohan commented Jun 4, 2024

Hi,

This is certainly not the intended behavior and a bug that we'll fix. I think it should work already if using Kokkos.

You may, however, also want to avoid having completely empty domains in your simulations. LAMMPS has functionality for resizing domains using balance (statically) or fix balance (periodically), see the LAMMPS documentation. If you're brave, you can try this in combination with comm_style tiled.

anjohan added a commit that referenced this issue Jun 4, 2024
…also force C++17 because of insert_or_assign. Compiles but untested.
@nec4
Copy link
Author

nec4 commented Jun 5, 2024

Thanks for the response! Of course, it was just something I noticed that happened rarely in a vacuum simulation :) . I will check out those LAMMPS options.

@samueldyoung29ctr
Copy link

I'm experiencing this behavior with the latest multicut branch from pair_allegro on CPU LAMMPS 02 August 2023, update 3. I did a couple epochs of training on the aspirin dataset (https://github.com/mir-group/allegro/blob/main/configs/example.yaml) and was trying to test that force field on geometry formed from the first frame of the aspirin dataset.

Even when running with a single MPI process (which I would think would prevent empty domains), I'm still getting this "cannot reshape tensor" error. Any ideas on what might be happening?

Packages used to build pair_allegro
  • slurm (pmi)
  • scl/gcc-toolset-10 (GCC 10 compilers)
  • intel/mkl/2024.1 (to be able to compile against libtorch, but not used for FFTs)
  • intel/compiler-rt/2024.1.0 (also seem to be needed to use libtorch)
  • mpich/gnu/3.3.2 (known working MPI implementation)
  • libtorch/2.0.0+cpu-cxx11-abi (known working version of libtorch on NERSC Perlmutter's lammps_allegro image)
  • amd/aocl/gcc/4.0 (Using AOCL-provided FFTW3 libraries for FFTs, built with GCC)
cmake command to build LAMMPS after patching with pair_allegro "multicut" branch
cmake ../cmake \
    -D CMAKE_BUILD_TYPE=Debug \
    -D LAMMPS_EXCEPTIONS=ON \
    -D BUILD_SHARED_LIBS=ON \
    -D BUILD_MPI=yes \
    -D BUILD_OMP=yes \
    -C ../cmake/presets/kokkos-openmp.cmake \
    -D PKG_KOKKOS=yes \
    -D Kokkos_ARCH_ZEN3=yes \
    -D BUILD_TOOLS=no \
    -D FFT=FFTW3 \
    -D FFT_KOKKOS=FFT3W \
    -D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
    -D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
    -D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
    -D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
    -D PKG_MANYBODY=yes \
    -D PKG_MOLECULE=yes \
    -D PKG_KSPACE=yes \
    -D PKG_REPLICA=yes \
    -D PKG_ASPHERE=yes \
    -D PKG_RIGID=yes \
    -D PKG_MPIIO=yes \
    -D PKG_COMPRESS=yes \
    -D PKG_H5MD=no \
    -D PKG_OPENMP=yes \
    -D CMAKE_POSITION_INDEPENDENT_CODE=yes \
    -D CMAKE_EXE_FLAGS="-dynamic" \
    -D FFT_FFTW_THREADS=on

$LAMMPS_ROOT is the custom prefix where I make install LAMMPS after compilation. $AOCL_ROOT and $AOCL_LIB are the installation location and library location of AOCL 4.0 on my system.

Input geometry (aspirin-with-topo.data)
LAMMPS data file. CGCMM style. atom_style full generated by VMD/TopoTools v1.8 on Wed Jul 10 16:22:01 -0400 2024
 21 atoms
 21 bonds
 0 angles
 0 dihedrals
 0 impropers
 3 atom types
 1 bond types
 0 angle types
 0 dihedral types
 0 improper types
 -5.163345 4.836655  xlo xhi
 -5.155887 4.844113  ylo yhi
 -5.014088 4.985912  zlo zhi

# Pair Coeffs
#
# 1  C
# 2  H
# 3  O

# Bond Coeffs
#
# 1  

 Masses

 1 12.010700 # C
 2 1.007940 # H
 3 15.999400 # O

 Atoms # full

1 1 1 0.000000 2.134489 -0.984361 -0.195218 # C 
2 1 1 0.000000 0.762644 0.959414 -1.679929 # C 
3 1 1 0.000000 2.660345 -0.407926 -1.307304 # C 
4 1 1 0.000000 1.910317 0.393966 -2.147020 # C 
5 1 1 0.000000 -3.030190 1.495405 0.719662 # C 
6 1 1 0.000000 0.849425 -0.550871 0.284375 # C 
7 1 1 0.000000 0.238447 0.473506 -0.404422 # C 
8 1 3 0.000000 0.897896 -2.276432 1.730061 # O 
9 1 3 0.000000 -2.383452 0.417779 -1.462857 # O 
10 1 3 0.000000 -0.476201 -0.529087 2.339259 # O 
11 1 1 0.000000 0.392992 -1.190237 1.537982 # C 
12 1 1 0.000000 -2.122985 0.951760 -0.397705 # C 
13 1 3 0.000000 -0.804666 1.286246 0.110509 # O 
14 1 2 0.000000 -0.493803 -1.186792 3.095974 # H 
15 1 2 0.000000 2.554735 -1.802497 0.392131 # H 
16 1 2 0.000000 0.330690 1.855711 -2.345264 # H 
17 1 2 0.000000 3.803794 -0.493719 -1.456203 # H 
18 1 2 0.000000 2.231141 0.557186 -3.124150 # H 
19 1 2 0.000000 -2.708930 2.484658 0.926928 # H 
20 1 2 0.000000 -4.130483 1.482167 0.431266 # H 
21 1 2 0.000000 -2.874148 1.003209 1.699485 # H 

 Bonds

1 1 1 3
2 1 1 15
3 1 1 6
4 1 2 4
5 1 2 16
6 1 2 7
7 1 3 4
8 1 3 17
9 1 4 18
10 1 5 12
11 1 5 20
12 1 5 19
13 1 5 21
14 1 6 7
15 1 6 11
16 1 7 13
17 1 8 11
18 1 9 12
19 1 10 11
20 1 10 14
21 1 12 13


LAMMPS input file (input.lammps)
# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style full

boundary p p p

# 2) System definition
read_data aspirin-with-topo.data

# 3) Simulation settings
pair_style allegro3232 # Accidentally trained with float32 dtypes
pair_coeff * * aspirin-model_11-jul-2024.pth C H O

# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax

# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj

# 5) Run
minimize 1.0e-4 1.0e-6 1000 10000
undump mintraj

# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all

# Logging
thermo 10

# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes

# Run MD
fix mynve all nve
fix mylgv all langevin 1.0 1.0 0.1 1530917

# Be sure to dump the MD trajectory
dump mdtraj all atom 1 mdtraj.lammpstrj
dump mdforces all custom 1 mdforces.lammpstrj x y z vx vy vz fx fy fz 

timestep 0.5
run 1000
undump mdtraj
Command to run LAMMPS
srun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps
LAMMPS output
LAMMPS (2 Aug 2023 - Update 3)
KOKKOS mode with Kokkos version 3.7.2 is enabled (src/KOKKOS/kokkos.cpp:108)
  using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos neigh full
# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style full

boundary p p p

# 2) System definition
read_data aspirin-with-topo.data
Reading data file ...
  orthogonal box = (-5.163345 -5.155887 -5.014088) to (4.836655 4.844113 4.985912)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  21 atoms
  scanning bonds ...
  4 = max bonds/atom
  reading bonds ...
  21 bonds
Finding 1-2 1-3 1-4 neighbors ...
  special bond factors lj:    0        0        0       
  special bond factors coul:  0        0        0       
     4 = max # of 1-2 neighbors
     6 = max # of 1-3 neighbors
    13 = max # of 1-4 neighbors
    15 = max # of special neighbors
  special bonds CPU = 0.004 seconds
  read_data CPU = 0.018 seconds

# 3) Simulation settings
pair_style allegro3232
pair_coeff * * aspirin-model_11-jul-2024.pth C H O

# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax

# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj

# 5) Run
minimize 1.0e-4 1.0e-6 1000 10000
WARNING: Using a manybody potential with bonds/angles/dihedrals and special_bond exclusions (src/pair.cpp:242)
WARNING: Bonds are defined but no bond style is set (src/force.cpp:193)
WARNING: Likewise 1-2 special neighbor interactions != 1.0 (src/force.cpp:195)
Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 3 3 3
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair allegro3232/kk, perpetual
      attributes: full, newton on, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 8.928 | 8.928 | 8.928 Mbytes
   Step          Time           Temp          PotEng         KinEng         TotEng         E_pair         E_bond       Econserve         Fmax     
         0   0              0             -405675.22      0             -405675.22     -405675.22      0             -405675.22      13.409355    
         1   0.001          0             -405679.39      0             -405679.39     -405679.39      0             -405679.39      6.1880095    
Loop time of 4.11094 on 1 procs for 1 steps with 21 atoms

96.3% CPU use with 1 MPI tasks x 1 OpenMP threads

Minimization stats:
  Stopping criterion = energy tolerance
  Energy initial, next-to-last, final = 
         -405675.21875      -405675.21875  -405679.388671875
  Force two-norm initial, final = 27.207462 16.492158
  Force max component initial, final = 13.409355 6.1880095
  Final line search alpha, max atom move = 0.0074574801 0.046146958
  Iterations, force evaluations = 1 1

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 4.1102     | 4.1102     | 4.1102     |   0.0 | 99.98
Bond    | 7.06e-06   | 7.06e-06   | 7.06e-06   |   0.0 |  0.00
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 4.4729e-05 | 4.4729e-05 | 4.4729e-05 |   0.0 |  0.00
Output  | 0          | 0          | 0          |   0.0 |  0.00
Modify  | 0          | 0          | 0          |   0.0 |  0.00
Other   |            | 0.0006423  |            |       |  0.02

Nlocal:             21 ave          21 max          21 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:            510 ave         510 max         510 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:          596 ave         596 max         596 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 596
Ave neighs/atom = 28.380952
Ave special neighs/atom = 8.5714286
Neighbor list builds = 0
Dangerous builds = 0
LAMMPS stderr
Exception: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/nequip/nn/_graph_model.py", line 29, in forward
        pass
    model = self.model
    return (model).forward(new_data, )
            ~~~~~~~~~~~~~~ <--- HERE
  File "code/__torch__/nequip/nn/_rescale.py", line 21, in batch_norm
    data: Dict[str, Tensor]) -> Dict[str, Tensor]:
    model = self.model
    data0 = (model).forward(data, )
             ~~~~~~~~~~~~~~ <--- HERE
    training = self.training
    if training:
  File "code/__torch__/nequip/nn/_grad_output.py", line 71, in layer_norm
      pass
    func = self.func
    data0 = (func).forward(data, )
             ~~~~~~~~~~~~~ <--- HERE
    _17 = [torch.sum(data0["total_energy"])]
    _18 = [pos, data0["_displacement"]]
  File "code/__torch__/nequip/nn/_graph_mixin.py", line 28, in AD_sum_backward
    input1 = (radial_basis).forward(input0, )
    input2 = (spharm).forward(input1, )
    input3 = (allegro).forward(input2, )
              ~~~~~~~~~~~~~~~~ <--- HERE
    input4 = (edge_eng).forward(input3, )
    input5 = (edge_eng_sum).forward(input4, )
  File "code/__torch__/allegro/nn/_allegro.py", line 168, in mean_0
    _n_scalar_outs = self._n_scalar_outs
    _38 = torch.slice(_37, 2, None, _n_scalar_outs[0])
    scalars = torch.reshape(_38, [(torch.size(features1))[0], -1])
              ~~~~~~~~~~~~~ <--- HERE
    features2 = (_04).forward(features1, )
    _39 = annotate(List[Optional[Tensor]], [active_edges0])

Traceback of TorchScript, original code (most recent call last):
  File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_graph_model.py", line 112, in forward
                new_data[k] = v
        # run the model
        data = self.model(new_data)
               ~~~~~~~~~~ <--- HERE
        return data
  File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_rescale.py", line 144, in batch_norm
    def forward(self, data: AtomicDataDict.Type) -> AtomicDataDict.Type:
        data = self.model(data)
               ~~~~~~~~~~ <--- HERE
        if self.training:
            # no scaling, but still need to promote for consistent dtype behavior
  File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_grad_output.py", line 305, in layer_norm
    
        # Call model and get gradients
        data = self.func(data)
               ~~~~~~~~~ <--- HERE
    
        grads = torch.autograd.grad(
  File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/nequip/nn/_graph_mixin.py", line 366, in AD_sum_backward
    def forward(self, input: AtomicDataDict.Type) -> AtomicDataDict.Type:
        for module in self:
            input = module(input)
                    ~~~~~~ <--- HERE
        return input
  File "<path_to_nequip_conda_env>/lib/python3.10/site-packages/allegro/nn/_allegro.py", line 585, in mean_0
            # features has shape [z][mul][k]
            # we know scalars are first
            scalars = features[:, :, : self._n_scalar_outs[layer_index]].reshape(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                features.shape[0], -1
            )
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
slurmstepd: error: *** STEP 3194616.0 ON <nodeid> CANCELLED AT 2024-07-11T20:47:53 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: <nodeid>: task 0: Killed

@Linux-cpp-lisp
Copy link
Collaborator

Hi @samueldyoung29ctr ,

This should be fixed now but on main, we've since cleaned up and merged everything down. If you can confirm, I will close this issue. Thanks!

@samueldyoung29ctr
Copy link

@Linux-cpp-lisp, thanks for the tip. I think I actually had a bad timestep and/or starting geometry. After fixing things, my simple Allegro model appears to evaluate for pair_allegro compiled both from the multicut and main branches.

@Linux-cpp-lisp
Copy link
Collaborator

Great! Glad it's resolved. (You can probably stay on main then, since it is the latest.)

@samueldyoung29ctr
Copy link

samueldyoung29ctr commented Aug 9, 2024

Hi @Linux-cpp-lisp, I'm running into this error again and am looking for some more guidance. This time, I tried a simpler system of water in a box. I trained an Allegro model on this dataset (dataset_1593.xyz, which I converted to ASE .traj format) from Cheng, et al. "Ab initio thermodynamics of liquid and solid water". I used a ~90/10 train/validation split, lmax = 2, and altered some of the default numbers of MLP layers. This dataset has some water molecules outside of the simulation box, so I trained both with and without first wrapping the atoms inside the box.

NequIP configuration for unwrapped dataset.

The configuration for the wrapped dataset differs only in the dataset_file_name and run_name, which reflect the wrapped version of the dataset I used.

BesselBasis_trainable: false
PolynomialCutoff_p: 48
append: true
ase_args:
  format: traj
avg_num_neighbors: auto
batch_size: 1
chemical_symbols:
- H
- O
dataset: ase
dataset_file_name: ./data.traj
dataset_seed: 123456
default_dtype: float32
early_stopping_lower_bounds:
  LR: 1.0e-05
early_stopping_patiences:
  validation_loss: 1000
early_stopping_upper_bounds:
  cumulative_wall: 604800.0
edge_eng_mlp_initialization: uniform
edge_eng_mlp_latent_dimensions:
- 32
edge_eng_mlp_nonlinearity: null
ema_decay: 0.99
ema_use_num_updates: true
embed_initial_edge: true
env_embed_mlp_initialization: uniform
env_embed_mlp_latent_dimensions: []
env_embed_mlp_nonlinearity: null
env_embed_multiplicity: 64
l_max: 2
latent_mlp_initialization: uniform
latent_mlp_latent_dimensions:
- 64
- 64
- 64
- 64
latent_mlp_nonlinearity: silu
latent_resnet: true
learning_rate: 0.001
loss_coeffs:
  forces: 1.0
  total_energy:
  - 1.0
  - PerAtomMSELoss
lr_scheduler_factor: 0.5
lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 25
max_epochs: 1000000
metrics_key: validation_loss
model_builders:
- allegro.model.Allegro
- PerSpeciesRescale
- StressForceOutput
- RescaleEnergyEtc
n_train: 1434
n_val: 159
num_layers: 1
optimizer_name: Adam
optimizer_params:
  amsgrad: false
  betas: !!python/tuple
  - 0.9
  - 0.999
  eps: 1.0e-08
  weight_decay: 0.0
parity: o3_full
r_max: 4.0
root: <root name>
run_name: <run name>
seed: 123456
shuffle: true
train_val_split: random
two_body_latent_mlp_initialization: uniform
two_body_latent_mlp_latent_dimensions:
- 32
- 64
two_body_latent_mlp_nonlinearity: silu
use_ema: true
verbose: debug
wandb: true
wandb_project: <project name>

The Conda environment I am using to train has NequIP 0.6.0 and Allegro symlinked as a development package at commit 22f673c.

Conda environment used to train Allegro model
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
ase                       3.23.0             pyhd8ed1ab_0    conda-forge
asttokens                 2.4.1              pyhd8ed1ab_0    conda-forge
blinker                   1.8.2              pyhd8ed1ab_0    conda-forge
brotli                    1.1.0                hd590300_1    conda-forge
brotli-bin                1.1.0                hd590300_1    conda-forge
brotli-python             1.1.0           py310hc6cd4ac_1    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2024.7.4             hbcca054_0    conda-forge
certifi                   2024.7.4           pyhd8ed1ab_0    conda-forge
cffi                      1.16.0          py310h2fee648_0    conda-forge
charset-normalizer        3.3.2              pyhd8ed1ab_0    conda-forge
click                     8.1.7           unix_pyh707e725_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
comm                      0.2.2              pyhd8ed1ab_0    conda-forge
contourpy                 1.2.1           py310hd41b1e2_0    conda-forge
cuda-version              11.8                 h70ddcb2_3    conda-forge
cudatoolkit               11.8.0              h4ba93d1_13    conda-forge
cudnn                     8.9.7.29             hbc23b4c_3    conda-forge
cycler                    0.12.1             pyhd8ed1ab_0    conda-forge
debugpy                   1.8.2           py310h76e45a6_0    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
docker-pycreds            0.4.0                      py_0    conda-forge
e3nn                      0.4.4              pyhd8ed1ab_1    conda-forge
exceptiongroup            1.2.0              pyhd8ed1ab_2    conda-forge
executing                 2.0.1              pyhd8ed1ab_0    conda-forge
flask                     3.0.3              pyhd8ed1ab_0    conda-forge
fonttools                 4.53.0          py310hc51659f_0    conda-forge
freetype                  2.12.1               h267a509_2    conda-forge
gitdb                     4.0.11             pyhd8ed1ab_0    conda-forge
gitpython                 3.1.43             pyhd8ed1ab_0    conda-forge
gmp                       6.3.0                hac33072_2    conda-forge
gmpy2                     2.1.5           py310hc7909c9_1    conda-forge
h2                        4.1.0              pyhd8ed1ab_0    conda-forge
hpack                     4.0.0              pyh9f0ad1d_0    conda-forge
hyperframe                6.0.1              pyhd8ed1ab_0    conda-forge
icu                       73.2                 h59595ed_0    conda-forge
idna                      3.7                pyhd8ed1ab_0    conda-forge
importlib-metadata        8.0.0              pyha770c72_0    conda-forge
importlib_metadata        8.0.0                hd8ed1ab_0    conda-forge
ipykernel                 6.29.5             pyh3099207_0    conda-forge
ipython                   8.26.0             pyh707e725_0    conda-forge
itsdangerous              2.2.0              pyhd8ed1ab_0    conda-forge
jedi                      0.19.1             pyhd8ed1ab_0    conda-forge
jinja2                    3.1.4              pyhd8ed1ab_0    conda-forge
joblib                    1.4.2              pyhd8ed1ab_0    conda-forge
jupyter_client            8.6.2              pyhd8ed1ab_0    conda-forge
jupyter_core              5.7.2           py310hff52083_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.5           py310hd41b1e2_1    conda-forge
krb5                      1.21.3               h659f571_0    conda-forge
lcms2                     2.16                 hb7c19ff_0    conda-forge
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libblas                   3.9.0            16_linux64_mkl    conda-forge
libbrotlicommon           1.1.0                hd590300_1    conda-forge
libbrotlidec              1.1.0                hd590300_1    conda-forge
libbrotlienc              1.1.0                hd590300_1    conda-forge
libcblas                  3.9.0            16_linux64_mkl    conda-forge
libdeflate                1.20                 hd590300_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 14.1.0               h77fa898_0    conda-forge
libgfortran-ng            14.1.0               h69a702a_0    conda-forge
libgfortran5              14.1.0               hc5f4f2c_0    conda-forge
libhwloc                  2.11.0          default_h5622ce7_1000    conda-forge
libiconv                  1.17                 hd590300_2    conda-forge
libjpeg-turbo             3.0.0                hd590300_1    conda-forge
liblapack                 3.9.0            16_linux64_mkl    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libopenblas               0.3.27          pthreads_hac2b453_1    conda-forge
libpng                    1.6.43               h2797004_0    conda-forge
libprotobuf               3.20.3               h3eb15da_0    conda-forge
libsodium                 1.0.18               h36c2ea0_1    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libstdcxx-ng              14.1.0               hc0a3c3a_0    conda-forge
libtiff                   4.6.0                h1dd3fc0_3    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libwebp-base              1.4.0                hd590300_0    conda-forge
libxcb                    1.16                 hd590300_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libxml2                   2.12.7               h4c95cb1_3    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
llvm-openmp               18.1.8               hf5423f3_0    conda-forge
magma                     2.5.4                hc72dce7_4    conda-forge
markupsafe                2.1.5           py310h2372a71_0    conda-forge
matplotlib-base           3.8.4           py310hef631a5_2    conda-forge
matplotlib-inline         0.1.7              pyhd8ed1ab_0    conda-forge
mir-allegro               0.2.0                     dev_0    <develop>
mkl                       2022.2.1         h6508926_16999    conda-forge
mpc                       1.3.1                hfe3b2da_0    conda-forge
mpfr                      4.2.1                h9458935_1    conda-forge
mpmath                    1.3.0              pyhd8ed1ab_0    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
nccl                      2.22.3.1             hee583db_0    conda-forge
ncurses                   6.5                  h59595ed_0    conda-forge
nequip                    0.6.0                     dev_0    <develop>
nest-asyncio              1.6.0              pyhd8ed1ab_0    conda-forge
ninja                     1.12.1               h297d8ca_0    conda-forge
numpy                     1.26.4          py310hb13e2d6_0    conda-forge
openjpeg                  2.5.2                h488ebb8_0    conda-forge
openssl                   3.3.1                h4ab18f5_1    conda-forge
opt-einsum                3.3.0                hd8ed1ab_2    conda-forge
opt_einsum                3.3.0              pyhc1e730c_2    conda-forge
opt_einsum_fx             0.1.4              pyhd8ed1ab_0    conda-forge
packaging                 24.1               pyhd8ed1ab_0    conda-forge
parso                     0.8.4              pyhd8ed1ab_0    conda-forge
pathtools                 0.1.2                      py_1    conda-forge
pexpect                   4.9.0              pyhd8ed1ab_0    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    10.4.0          py310hebfe307_0    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
platformdirs              4.2.2              pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.47             pyha770c72_0    conda-forge
protobuf                  3.20.3          py310heca2aa9_1    conda-forge
psutil                    6.0.0           py310hc51659f_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
pycparser                 2.22               pyhd8ed1ab_0    conda-forge
pygments                  2.18.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.1.2              pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1           py310hff52083_5    conda-forge
python                    3.10.14         hd12c33a_0_cpython    conda-forge
python-dateutil           2.9.0post0      py310h06a4308_2  
python_abi                3.10                    4_cp310    conda-forge
pytorch                   1.11.0          cuda112py310h51fe464_202    conda-forge
pyyaml                    6.0.1           py310h2372a71_1    conda-forge
pyzmq                     26.0.3          py310h6883aea_0    conda-forge
readline                  8.2                  h8228510_1    conda-forge
requests                  2.32.3             pyhd8ed1ab_0    conda-forge
scikit-learn              1.0.1           py310h1246948_3    conda-forge
scipy                     1.14.0          py310h93e2701_1    conda-forge
sentry-sdk                2.10.0             pyhd8ed1ab_0    conda-forge
setproctitle              1.3.3           py310h2372a71_0    conda-forge
setuptools                70.1.1             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sleef                     3.5.1                h9b69904_2    conda-forge
smmap                     5.0.0              pyhd8ed1ab_0    conda-forge
stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
sympy                     1.12.1          pypyh2585a3b_103    conda-forge
tbb                       2021.12.0            h434a139_2    conda-forge
threadpoolctl             3.5.0              pyhc1e730c_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
torch-ema                 0.3                pyhd8ed1ab_0    conda-forge
torch-runstats            0.2.0              pyhd8ed1ab_0    conda-forge
tornado                   6.4.1           py310hc51659f_0    conda-forge
tqdm                      4.66.4             pyhd8ed1ab_0    conda-forge
traitlets                 5.14.3             pyhd8ed1ab_0    conda-forge
typing-extensions         4.12.2               hd8ed1ab_0    conda-forge
typing_extensions         4.12.2             pyha770c72_0    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
unicodedata2              15.1.0          py310h2372a71_0    conda-forge
urllib3                   2.2.2              pyhd8ed1ab_1    conda-forge
wandb                     0.16.6             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.13             pyhd8ed1ab_0    conda-forge
werkzeug                  3.0.3              pyhd8ed1ab_0    conda-forge
wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
xorg-libxau               1.0.11               hd590300_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
zeromq                    4.3.5                h75354e8_4    conda-forge
zipp                      3.19.2             pyhd8ed1ab_0    conda-forge
zstandard                 0.23.0          py310h64cae3c_0    conda-forge
zstd                      1.5.6                ha6fb4c9_0    conda-forge

Training was done on Nvidia A100 GPUs. The training converges quickly since this dataset doesn't have very large forces on the atoms. I then deployed the models to standalone format:

nequip-deploy build --train-dir "<path to train dir of model for unwrapped data>" model-unwrapped_08-aug-2024.pth

nequip-deploy build --train-dir "<path to train dir of model for wrapped data>" model-wrapped_08-aug-2024.pth

I then used the standalone model files to run LAMMPS jobs on a different cluster where I have more computer time. I compiled LAMMPS for CPU with pair_allegro and Kokkos (a combination which apparently is not available yet on NERSC). I used the 02Aug2023 version of LAMMPS and, based on your previous advice, patched it with pair_allegro commit 20538c9, which is the current main and one commit later than the patch (89e3ce1) that was supposed to fix empty domains in non-Kokkos simulations. I compiled LAMMPS with GCC 10 compilers, FFTW3 from AMD AOCL 4.0 (compiled with GCC compilers), GNU MPICH 3.3.2, and also exposed Intel oneAPI 2024.1 MKL libraries to satisfy the cmake preprocessing step. I linked against libtorch 2.0.0 (CPU, CXX11 ABI, available here) based on the advice of NERSC consultants.

My CMake generation command
cmake ../cmake \
    -D CMAKE_BUILD_TYPE=Debug \
    -D LAMMPS_EXCEPTIONS=ON \
    -D BUILD_SHARED_LIBS=ON \
    -D BUILD_MPI=yes \
    -D BUILD_OMP=yes \
    -C ../cmake/presets/kokkos-openmp.cmake \
    -D PKG_KOKKOS=yes \
    -D Kokkos_ARCH_ZEN3=yes \
    -D BUILD_TOOLS=no \
    -D FFT=FFTW3 \
    -D FFT_KOKKOS=FFT3W \
    -D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
    -D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
    -D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
    -D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
    -D PKG_MANYBODY=yes \
    -D PKG_MOLECULE=yes \
    -D PKG_KSPACE=yes \
    -D PKG_REPLICA=yes \
    -D PKG_ASPHERE=yes \
    -D PKG_RIGID=yes \
    -D PKG_MPIIO=yes \
    -D PKG_COMPRESS=yes \
    -D PKG_H5MD=no \
    -D PKG_OPENMP=yes \
    -D CMAKE_POSITION_INDEPENDENT_CODE=yes \
    -D CMAKE_EXE_FLAGS="-dynamic" \
    -D FFT_FFTW_THREADS=on

I set up several LAMMPS jobs using as initial geometries four randomly selected frames of that dataset_1593.xyz dataset and my two trained Allegro models (model-wrapped_08-aug-2024.pth and model-unwrapped_08-aug-2024.pth).

Example LAMMPS input script (input.lammps)
# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style atomic

boundary p p p

# 2) System definition

# initial_frame.data will be written into the working directory where this
# script is located.
read_data initial_frame.data

# 3) Simulation settings
# pair_style lj/cut 2.5
mass 1 2.016
mass 2 15.999

pair_style allegro3232
pair_coeff * * ../../h2o-behlerdataset-allegro-train-layerhpo-fe76_07-aug-2024/model-unwrapped_08-aug-2024.pth H O
# Or "model-wrapped_08-aug-2024.pth" if using frames from the wrapped dataset.

# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax

# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj

# 5) Run CG minimization, doing a single static balance first to print out subdomain cut locations.
balance 1.0 shift xyz 100 1.0
minimize 1.0e-8 1.0e-8 1000 1000000000
undump mintraj
reset_timestep 0 time 0.0

# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all

# Logging
thermo 1

# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes binsize 10.0

# Also try to specify larger cutoff for ghost atoms to avoid losing atoms.
comm_modify mode single cutoff 10.0 vel yes

# Try specifying initial velocities for all atoms
velocity all create 298.0 4928459 dist gaussian

# Run MD in the NVT ensemble, with a Nosé-Hoover thermostat-barostat starting at 298.0 K.
fix mynose all nvt & 
    temp 298.0 298.0 0.011 &

# Be sure to dump the MD trajectory
dump mdtraj all atom 1 mdtraj.lammpstrj
dump mdforces all custom 40 mdforces.lammpstrj id x y z vx vy vz fx fy fz 

timestep 0.0005

# Normal run, with a single balance first
balance 1.0 shift xyz 100 1.0
run 20000

undump mdtraj
undump mdforces
Example input geometry (frame 596 of the original dataset)
model-unwrapped_08-aug-2024.pth-frame596-kokkos/initial_frame.data (written by ASE) 

192 	 atoms 
2  atom types
0.0                23.14461  xlo xhi
0.0                23.14461  ylo yhi
0.0                23.14461  zlo zhi


Atoms 

     1   2      5.9091500000000003      9.6715599999999995                 3.23007
     2   1      6.0198799999999997                 11.3194      2.4536199999999999
     3   1      5.4012099999999998      9.8348099999999992      4.9311199999999999
     4   2      22.080200000000001      1.6895899999999999                 13.5389
     5   1                 21.2834     0.52559500000000003      14.782999999999999
     6   1      20.552199999999999                 2.33968                 12.6881
     7   2                 11.1053                 14.1869      6.9613100000000001
     8   1      10.151300000000001                 13.8668      5.6778700000000004
     9   1                 12.3499                 12.8649                 6.86449
    10   2      16.049700000000001                 19.7988                 10.1698
    11   1      17.095400000000001                 18.6266      9.2250499999999995
    12   1      16.745999999999999      19.773900000000001                 11.8696
    13   2      7.2639100000000001                 14.3575      22.238099999999999
    14   1      6.4026899999999998                 14.9102      20.775099999999998
    15   1      8.9776199999999999                 14.8261      21.689800000000002
    16   2                 14.0701      11.291600000000001                 13.6586
    17   1      13.208399999999999                 12.4938                 12.5162
    18   1      13.355499999999999      11.535299999999999                 15.2211
    19   2      2.2179700000000002      6.5658799999999999      11.150600000000001
    20   1      2.0281199999999999                 7.61151                 12.1075
    21   1      2.2052399999999999                 5.03423                 12.0829
    22   2                 23.1968      17.582100000000001                 12.8947
    23   1      1.1530100000000001                 16.1249                 13.3102
    24   1      1.0125599999999999      18.971399999999999      12.323700000000001
    25   2      13.076599999999999      14.747400000000001     0.96582699999999999
    26   1                 14.6065      15.859400000000001      1.2677700000000001
    27   1                 13.7464      13.523099999999999      2.1084399999999999
    28   2      22.289999999999999      23.165099999999999      7.9897999999999998
    29   1      22.927099999999999                0.161326      9.6950800000000008
    30   1     0.51338300000000003      22.870999999999999      6.6900599999999999
    31   2                 18.6097                 13.7941      11.648199999999999
    32   1      18.463799999999999                 15.2698                 10.5924
    33   1      17.077100000000002                 13.6417                  12.535
    34   2      8.8639799999999997      22.649000000000001      3.5996299999999999
    35   1      10.118399999999999      21.665299999999998      4.2162199999999999
    36   1      9.3582199999999993    0.085460999999999995      1.9719199999999999
    37   2                 2.05017      16.757200000000001      4.8366499999999997
    38   1                 1.44848                 15.0244      5.2370700000000001
    39   1      3.0690900000000001      17.333600000000001      6.0353300000000001
    40   2      22.596299999999999                 10.3612                 14.7997
    41   1      21.734500000000001      8.8241099999999992                 14.8378
    42   1      21.906500000000001                 11.1448                 13.2875
    43   2      4.3652499999999996      18.384399999999999                 19.2408
    44   1      3.2898800000000001      16.997699999999998      18.443200000000001
    45   1      4.6726099999999997      19.560600000000001      17.828199999999999
    46   2      5.5318699999999996      9.4371899999999993                 7.85025
    47   1      6.7868599999999999      8.3062900000000006      8.1544899999999991
    48   1      3.9088400000000001      8.6284299999999998      8.6374399999999998
    49   2                 20.0229      4.7849899999999996      20.125499999999999
    50   1                 21.5364      5.3525200000000002      19.273299999999999
    51   1      18.720700000000001      5.8096899999999998      19.124700000000001
    52   2      13.950200000000001                 22.8736      21.118300000000001
    53   1      14.283899999999999      21.484200000000001      22.181999999999999
    54   1      12.941000000000001      22.204699999999999                  19.863
    55   2      20.478899999999999      6.6597200000000001      7.2229599999999996
    56   1      20.589700000000001      8.1250999999999998      8.3511699999999998
    57   1                  19.509      7.0342799999999999      5.7630600000000003
    58   2                 11.3894      3.5159799999999999     0.69348500000000002
    59   1      12.741899999999999      3.6259199999999998                  1.8147
    60   1                 12.0502      2.2654100000000001      22.613800000000001
    61   2                 4.29528      8.6966599999999996      18.260999999999999
    62   1      5.0197099999999999      7.1608200000000002                 19.0062
    63   1      5.0883900000000004      9.8124300000000009      17.133199999999999
    64   2      19.012699999999999                 15.0678                 21.8294
    65   1      20.098199999999999                  13.677                 21.0869
    66   1      19.755400000000002      16.159800000000001      22.837800000000001
    67   2                 18.5093                 18.2121                 14.6373
    68   1      18.847100000000001      19.651299999999999                 15.6972
    69   1      20.258400000000002      17.533999999999999                 14.4245
    70   2      20.737200000000001      6.5509399999999998      1.7896799999999999
    71   1      20.214300000000001      6.2219499999999996     0.10664999999999999
    72   1      20.418800000000001      5.1312300000000004                 2.89337
    73   2                 11.9665                 15.8748      12.234299999999999
    74   1                  11.057                 15.8642      10.595700000000001
    75   1      12.898199999999999                  17.384                 12.1503
    76   2      7.3605999999999998      15.648300000000001      3.7366000000000001
    77   1      6.3703000000000003                 15.7403      2.1144099999999999
    78   1      6.0848300000000002      15.904400000000001      5.0577399999999999
    79   2                 11.4945                 12.6968      18.411100000000001
    80   1                 11.9612      11.419499999999999                 19.6524
    81   1                 12.6492      14.269299999999999      18.424700000000001
    82   2      6.2639100000000001      4.2544399999999998                 21.3217
    83   1      4.6653799999999999      3.6882799999999998      22.102699999999999
    84   1      7.4899199999999997      4.5268499999999996      22.556100000000001
    85   2                 1.97488      2.7933599999999998     0.38653500000000002
    86   1      1.5985100000000001      2.8812000000000002      2.2288800000000002
    87   1     0.46284500000000001                 3.76661      22.937000000000001
    88   2      15.752800000000001                 1.73661      16.740600000000001
    89   1      15.033899999999999      2.6949800000000002                 15.4696
    90   1      14.998799999999999     0.69597399999999998                 18.0061
    91   2      1.0868500000000001      4.8048799999999998      17.550000000000001
    92   1      1.2302599999999999      6.5176400000000001      17.189800000000002
    93   1                0.442882      3.9655900000000002      16.058299999999999
    94   2      4.2452399999999999      23.068000000000001      6.1464999999999996
    95   1      4.5604500000000003      22.587299999999999      4.4141199999999996
    96   1      4.2884799999999998      1.8068200000000001      6.0832800000000002
    97   2      8.8760899999999996      20.350000000000001                 21.2438
    98   1      8.4095899999999997      18.556999999999999      21.336300000000001
    99   1      9.3341899999999995      20.768999999999998    -0.27107100000000001
   100   2      1.2605500000000001                 11.5184                 3.75576
   101   1  -0.0060600000000000003                 11.8523      2.5569799999999998
   102   1      2.7166100000000002      10.385400000000001                 2.84429
   103   2      21.119399999999999      18.853000000000002      1.2561500000000001
   104   1                 22.4404      18.017399999999999      2.3485900000000002
   105   1                 22.2773      19.497399999999999    0.028347000000000001
   106   2      13.842000000000001                   10.01      4.5879599999999998
   107   1      13.560499999999999      9.4886400000000002      2.7970199999999998
   108   1                 15.4298                 10.9476      4.7816700000000001
   109   2      20.032299999999999      18.190999999999999      7.6835699999999996
   110   1                 20.9389      20.223199999999999      7.5477999999999996
   111   1      21.233899999999998                 17.8385      9.0412300000000005
   112   2      9.7168899999999994      6.8385300000000004      8.7584900000000001
   113   1      9.2112200000000009                  5.1589      7.9785599999999999
   114   1      9.6004199999999997      6.3682999999999996      10.559699999999999
   115   2      7.6963800000000004                 12.7331                 14.7662
   116   1      7.3576100000000002      13.795999999999999      13.268700000000001
   117   1      9.4222699999999993                 12.6839                 15.1022
   118   2      20.122800000000002      22.088200000000001      17.957899999999999
   119   1                 19.1126                0.304678      18.636299999999999
   120   1      21.934100000000001      21.968599999999999      18.249700000000001
   121   2                 12.8415      19.258099999999999      5.8808800000000003
   122   1      13.239100000000001                 19.8367      7.6356799999999998
   123   1                 11.4961      18.447900000000001      6.2407700000000004
   124   2                 14.9815      16.554600000000001      18.867599999999999
   125   1      15.806100000000001      16.940799999999999      17.265499999999999
   126   1      17.521599999999999                 15.5898      21.355899999999998
   127   2                0.834036                 11.5718      21.333100000000002
   128   1      1.8740600000000001                 12.8818                 22.0655
   129   1                 1.97116      10.260199999999999      20.686599999999999
   130   2      9.2284900000000007      7.4919900000000004                 14.4091
   131   1      9.9167799999999993      7.5134299999999996                 15.6699
   132   1      8.8360599999999998      9.2631099999999993                 14.0289
   133   2                 11.2926      9.4404199999999996      23.383500000000002
   134   1                 11.2235      7.7517800000000001      23.410699999999999
   135   1      9.5579999999999998      9.8135600000000007     0.48632599999999998
   136   2      17.805299999999999      7.6026400000000001                 14.9404
   137   1      17.738900000000001      8.6731300000000005      13.872199999999999
   138   1      18.512699999999999      6.5323500000000001                 13.7552
   139   2                 13.8725      6.8589799999999999                 18.3369
   140   1                 14.7568      7.7961799999999997      16.747599999999998
   141   1                 14.2013      4.9680799999999996                  18.401
   142   2                 23.0228                 11.8605      8.8937100000000004
   143   1      23.116099999999999                 11.4161                 7.14086
   144   1                 21.6477                 13.0159      8.9233600000000006
   145   2      8.7789699999999993                 1.75627      9.8980899999999998
   146   1                 10.4735      1.3146899999999999                 10.7338
   147   1      8.6431199999999997     0.58794000000000002      8.4552099999999992
   148   2                 16.2333                 10.4991      21.402000000000001
   149   1      15.495200000000001      9.1909200000000002      20.715499999999999
   150   1                 14.9786      11.777100000000001                 21.8293
   151   2      19.203800000000001                 2.22892      4.6112099999999998
   152   1      18.985499999999998                 1.41039                 3.15361
   153   1                 19.7928      1.4284600000000001      6.0571200000000003
   154   2      5.6809399999999997      3.0451199999999998                 13.1189
   155   1      7.1078200000000002      2.3069199999999999      12.010300000000001
   156   1      6.3725800000000001                 4.65273                 13.7621
   157   2      4.4194199999999997      19.215199999999999                 10.3683
   158   1      5.1798700000000002                 20.2973      9.0558099999999992
   159   1      5.5698600000000003      17.738399999999999                 10.7018
   160   2                 14.1355      1.4335500000000001      10.208299999999999
   161   1                 13.9558      2.3745500000000002      8.6041399999999992
   162   1                 15.8993      23.871400000000001                 10.3802
   163   2      2.6869399999999999                 4.59931      5.1904700000000004
   164   1      3.9838100000000001      5.8780299999999999      4.4879899999999999
   165   1                  1.6192      5.6871299999999998      6.0968900000000001
   166   2                 18.5716      4.4131600000000004                 11.1807
   167   1      18.793900000000001      4.3240100000000004      9.4553799999999999
   168   1      16.422000000000001      3.7141500000000001                 11.2049
   169   2      14.390499999999999      4.1494999999999997                 5.31778
   170   1      16.127500000000001      4.6077500000000002      4.9381399999999998
   171   1                 13.5379      5.9167800000000002      5.6488399999999999
   172   2      11.933199999999999      20.552900000000001      14.781700000000001
   173   1      13.100099999999999      20.525500000000001      16.201599999999999
   174   1                 10.1973                 20.4255      15.465400000000001
   175   2                 18.6493      12.014699999999999                 3.00224
   176   1      19.017399999999999      12.237500000000001                 1.32315
   177   1      19.223800000000001      10.126899999999999      2.9812400000000001
   178   2      6.4923799999999998                 14.4697                 10.1922
   179   1      5.3334700000000002      14.067399999999999      8.8764900000000004
   180   1      7.8080600000000002      13.923400000000001      9.6611799999999999
   181   2                 6.55314                 21.3005      15.272399999999999
   182   1      6.0889600000000002      20.022300000000001      13.979799999999999
   183   1      5.2370400000000004      22.694900000000001      15.271599999999999
   184   2      16.850000000000001      21.171299999999999      2.4119799999999998
   185   1                 17.9634      19.734100000000002                 2.57178
   186   1                 15.6921      20.852900000000002      3.6812200000000002
   187   2                 1.12944      21.249300000000002      21.300999999999998
   188   1      2.2810199999999998      19.896999999999998      20.537099999999999
   189   1      1.8933500000000001                 22.6784                 22.1388
   190   2      2.6025700000000001      14.353199999999999      16.236699999999999
   191   1      4.3327799999999996                  13.993      16.651299999999999
   192   1      1.8016700000000001      12.796900000000001                 15.3421

I ran these 8 jobs both with and without Kokkos, like this:

Example job script
#!/bin/bash
#SBATCH --job-name=model-unwrapped_08-aug-2024.pth-frame607
#SBATCH --account=...
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=128
#SBATCH --exclusive
#SBATCH --time=10:00
#SBATCH --error=vt_lammps%j.err
#SBATCH --output=vt_lammps%j.out
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#
#SBATCH --open-mode=append

# OpenMP parallelization
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export OMP_PROC_BIND=spread

# By default, prefer the GCC10 build of LAMMPS + pair_allegro
module load lammps-tpc/2Aug23/gcc10-allegro-cpu

# Ensure that stack size is unlimited, or you may get a segfault error when
# attempting to run a MPI job.
ulimit -s unlimited
ulimit -S unlimited
ulimit -H unlimited

srun lmp -in input.lammps
# or "srun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps", if running with Kokkos.

Nearly all cases results in the Torch reshape error, after the simulation has proceeded for some number of steps. (The one case where it has not is a segfault.)

Last MD or minimization step Torch reshape error?
model-wrapped_08-aug-2024.pth-frame607-kokkos 321 True
model-wrapped_08-aug-2024.pth-frame607 1741 True
model-wrapped_08-aug-2024.pth-frame596-kokkos 31 True
model-wrapped_08-aug-2024.pth-frame596 454 True
model-wrapped_08-aug-2024.pth-frame1351-kokkos 270 True
model-wrapped_08-aug-2024.pth-frame1351 616 True
model-wrapped_08-aug-2024.pth-frame1252-kokkos 179 True
model-wrapped_08-aug-2024.pth-frame1252 1497 True
model-unwrapped_08-aug-2024.pth-frame607-kokkos 330 True
model-unwrapped_08-aug-2024.pth-frame607 271 True
model-unwrapped_08-aug-2024.pth-frame596-kokkos 31 True
model-unwrapped_08-aug-2024.pth-frame596 1744 True
model-unwrapped_08-aug-2024.pth-frame1351-kokkos 290 False
model-unwrapped_08-aug-2024.pth-frame1351 754 True
model-unwrapped_08-aug-2024.pth-frame1252-kokkos 182 True
model-unwrapped_08-aug-2024.pth-frame1252 1734 True

Additionally, I examined the domain decomposition chosen for one typical job, using the X, Y, and Z cut locations printed to screen by the LAMMPS balance command to count how many atoms were in domain at each frame of the simulation. While the precision of these cut points undoubtedly causes some rounding error, I was surprised to find that 32 of the 128 domains were empty at the first frame of MD simulation. So it may be more complex than a domain simply going empty at a later point in the simulation.

However, my limited testing does support the idea that empty domains make the simulation more likely to crash from a Torch reshape error. On another system I'm researching with ~500 atoms (water solvated transition metal atoms), I see approximate inverse proportionality in the number of steps my MD simulation will complete before a Torch reshape crash and the number of domains:

  • 64 MPI tasks: 845 steps
  • 32 MPI tasks: 1644 steps
  • 16 MPI tasks: O(5000) steps by the time the job times out at 10 min.

This happens even despite the fact that using fewer domains for some reason tends to produce different results for the pre-MD conjugate gradient minimization. 16 MPI tasks produce a minimized geometry where atoms concentrated in one half of the square box, while larger numbers of MPI tasks produce more uniformly distributed geometry, so the fact that the 16-task job does not produce a Torch reshape error even at O(5000) steps makes empty domains seem a more likely cause of this error.

I'm not sure what else to try. I've tried things like forcing a domain rebalance after each MD step and increasing the neighbor list and ghost atom communication cutoffs, but I'm still encountering Torch reshape errors for all but the smallest number of domains.

Do you have any guidance on what to try, or tests you'd like me to run?

Thanks!


Edit: running larger systems with the same number of domain decompositions seems to work. The water system above is 64 waters in a box. If I instead replicate to 3 copies in each dimension (1728 waters), I can run 20k steps at 128 MPI tasks with no Torch reshape error, both with and without Kokkos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants