-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allegro_pair style and empty partitions #45
Comments
Hi, This is certainly not the intended behavior and a bug that we'll fix. I think it should work already if using Kokkos. You may, however, also want to avoid having completely empty domains in your simulations. LAMMPS has functionality for resizing domains using |
…also force C++17 because of insert_or_assign. Compiles but untested.
Thanks for the response! Of course, it was just something I noticed that happened rarely in a vacuum simulation :) . I will check out those LAMMPS options. |
I'm experiencing this behavior with the latest Even when running with a single MPI process (which I would think would prevent empty domains), I'm still getting this "cannot reshape tensor" error. Any ideas on what might be happening? Packages used to build pair_allegro
cmake command to build LAMMPS after patching with pair_allegro "multicut" branchcmake ../cmake \
-D CMAKE_BUILD_TYPE=Debug \
-D LAMMPS_EXCEPTIONS=ON \
-D BUILD_SHARED_LIBS=ON \
-D BUILD_MPI=yes \
-D BUILD_OMP=yes \
-C ../cmake/presets/kokkos-openmp.cmake \
-D PKG_KOKKOS=yes \
-D Kokkos_ARCH_ZEN3=yes \
-D BUILD_TOOLS=no \
-D FFT=FFTW3 \
-D FFT_KOKKOS=FFT3W \
-D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
-D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
-D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
-D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
-D PKG_MANYBODY=yes \
-D PKG_MOLECULE=yes \
-D PKG_KSPACE=yes \
-D PKG_REPLICA=yes \
-D PKG_ASPHERE=yes \
-D PKG_RIGID=yes \
-D PKG_MPIIO=yes \
-D PKG_COMPRESS=yes \
-D PKG_H5MD=no \
-D PKG_OPENMP=yes \
-D CMAKE_POSITION_INDEPENDENT_CODE=yes \
-D CMAKE_EXE_FLAGS="-dynamic" \
-D FFT_FFTW_THREADS=on
Input geometry (aspirin-with-topo.data)
LAMMPS input file (input.lammps)# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style full
boundary p p p
# 2) System definition
read_data aspirin-with-topo.data
# 3) Simulation settings
pair_style allegro3232 # Accidentally trained with float32 dtypes
pair_coeff * * aspirin-model_11-jul-2024.pth C H O
# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax
# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj
# 5) Run
minimize 1.0e-4 1.0e-6 1000 10000
undump mintraj
# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all
# Logging
thermo 10
# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes
# Run MD
fix mynve all nve
fix mylgv all langevin 1.0 1.0 0.1 1530917
# Be sure to dump the MD trajectory
dump mdtraj all atom 1 mdtraj.lammpstrj
dump mdforces all custom 1 mdforces.lammpstrj x y z vx vy vz fx fy fz
timestep 0.5
run 1000
undump mdtraj Command to run LAMMPSsrun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps LAMMPS output
LAMMPS stderr
|
Hi @samueldyoung29ctr , This should be fixed now but on |
@Linux-cpp-lisp, thanks for the tip. I think I actually had a bad timestep and/or starting geometry. After fixing things, my simple Allegro model appears to evaluate for |
Great! Glad it's resolved. (You can probably stay on main then, since it is the latest.) |
Hi @Linux-cpp-lisp, I'm running into this error again and am looking for some more guidance. This time, I tried a simpler system of water in a box. I trained an Allegro model on this dataset ( NequIP configuration for unwrapped dataset.The configuration for the wrapped dataset differs only in the BesselBasis_trainable: false
PolynomialCutoff_p: 48
append: true
ase_args:
format: traj
avg_num_neighbors: auto
batch_size: 1
chemical_symbols:
- H
- O
dataset: ase
dataset_file_name: ./data.traj
dataset_seed: 123456
default_dtype: float32
early_stopping_lower_bounds:
LR: 1.0e-05
early_stopping_patiences:
validation_loss: 1000
early_stopping_upper_bounds:
cumulative_wall: 604800.0
edge_eng_mlp_initialization: uniform
edge_eng_mlp_latent_dimensions:
- 32
edge_eng_mlp_nonlinearity: null
ema_decay: 0.99
ema_use_num_updates: true
embed_initial_edge: true
env_embed_mlp_initialization: uniform
env_embed_mlp_latent_dimensions: []
env_embed_mlp_nonlinearity: null
env_embed_multiplicity: 64
l_max: 2
latent_mlp_initialization: uniform
latent_mlp_latent_dimensions:
- 64
- 64
- 64
- 64
latent_mlp_nonlinearity: silu
latent_resnet: true
learning_rate: 0.001
loss_coeffs:
forces: 1.0
total_energy:
- 1.0
- PerAtomMSELoss
lr_scheduler_factor: 0.5
lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 25
max_epochs: 1000000
metrics_key: validation_loss
model_builders:
- allegro.model.Allegro
- PerSpeciesRescale
- StressForceOutput
- RescaleEnergyEtc
n_train: 1434
n_val: 159
num_layers: 1
optimizer_name: Adam
optimizer_params:
amsgrad: false
betas: !!python/tuple
- 0.9
- 0.999
eps: 1.0e-08
weight_decay: 0.0
parity: o3_full
r_max: 4.0
root: <root name>
run_name: <run name>
seed: 123456
shuffle: true
train_val_split: random
two_body_latent_mlp_initialization: uniform
two_body_latent_mlp_latent_dimensions:
- 32
- 64
two_body_latent_mlp_nonlinearity: silu
use_ema: true
verbose: debug
wandb: true
wandb_project: <project name> The Conda environment I am using to train has NequIP 0.6.0 and Allegro symlinked as a development package at commit 22f673c. Conda environment used to train Allegro model
Training was done on Nvidia A100 GPUs. The training converges quickly since this dataset doesn't have very large forces on the atoms. I then deployed the models to standalone format: nequip-deploy build --train-dir "<path to train dir of model for unwrapped data>" model-unwrapped_08-aug-2024.pth
nequip-deploy build --train-dir "<path to train dir of model for wrapped data>" model-wrapped_08-aug-2024.pth I then used the standalone model files to run LAMMPS jobs on a different cluster where I have more computer time. I compiled LAMMPS for CPU with pair_allegro and Kokkos (a combination which apparently is not available yet on NERSC). I used the 02Aug2023 version of LAMMPS and, based on your previous advice, patched it with pair_allegro commit 20538c9, which is the current My CMake generation commandcmake ../cmake \
-D CMAKE_BUILD_TYPE=Debug \
-D LAMMPS_EXCEPTIONS=ON \
-D BUILD_SHARED_LIBS=ON \
-D BUILD_MPI=yes \
-D BUILD_OMP=yes \
-C ../cmake/presets/kokkos-openmp.cmake \
-D PKG_KOKKOS=yes \
-D Kokkos_ARCH_ZEN3=yes \
-D BUILD_TOOLS=no \
-D FFT=FFTW3 \
-D FFT_KOKKOS=FFT3W \
-D FFTW3_INCLUDE_DIR=$AOCL_ROOT/include \
-D FFTW3_LIBRARY=$AOCL_LIB/libfftw3.so \
-D FFTW3_OMP_LIBRARY=$AOCL_LIB/libfftw3_omp.so \
-D CMAKE_INSTALL_PREFIX="$LAMMPS_ROOT" \
-D PKG_MANYBODY=yes \
-D PKG_MOLECULE=yes \
-D PKG_KSPACE=yes \
-D PKG_REPLICA=yes \
-D PKG_ASPHERE=yes \
-D PKG_RIGID=yes \
-D PKG_MPIIO=yes \
-D PKG_COMPRESS=yes \
-D PKG_H5MD=no \
-D PKG_OPENMP=yes \
-D CMAKE_POSITION_INDEPENDENT_CODE=yes \
-D CMAKE_EXE_FLAGS="-dynamic" \
-D FFT_FFTW_THREADS=on I set up several LAMMPS jobs using as initial geometries four randomly selected frames of that Example LAMMPS input script (input.lammps)# PART A - ENERGY MINIMIZATION
# 1) Initialization
units metal
dimension 3
atom_style atomic
boundary p p p
# 2) System definition
# initial_frame.data will be written into the working directory where this
# script is located.
read_data initial_frame.data
# 3) Simulation settings
# pair_style lj/cut 2.5
mass 1 2.016
mass 2 15.999
pair_style allegro3232
pair_coeff * * ../../h2o-behlerdataset-allegro-train-layerhpo-fe76_07-aug-2024/model-unwrapped_08-aug-2024.pth H O
# Or "model-wrapped_08-aug-2024.pth" if using frames from the wrapped dataset.
# 4) Visualization
thermo 1
thermo_style custom step time temp pe ke etotal epair ebond econserve fmax
# Also want to dump the CG minimization trajectory
dump mintraj all atom 1 minimization.lammpstrj
# 5) Run CG minimization, doing a single static balance first to print out subdomain cut locations.
balance 1.0 shift xyz 100 1.0
minimize 1.0e-8 1.0e-8 1000 1000000000
undump mintraj
reset_timestep 0 time 0.0
# PART B - MOLECULAR DYNAMICS
delete_atoms overlap 0.1 all all
# Logging
thermo 1
# Try to rebuild neighbor lists more often
neigh_modify every 1 delay 0 check yes binsize 10.0
# Also try to specify larger cutoff for ghost atoms to avoid losing atoms.
comm_modify mode single cutoff 10.0 vel yes
# Try specifying initial velocities for all atoms
velocity all create 298.0 4928459 dist gaussian
# Run MD in the NVT ensemble, with a Nosé-Hoover thermostat-barostat starting at 298.0 K.
fix mynose all nvt &
temp 298.0 298.0 0.011 &
# Be sure to dump the MD trajectory
dump mdtraj all atom 1 mdtraj.lammpstrj
dump mdforces all custom 40 mdforces.lammpstrj id x y z vx vy vz fx fy fz
timestep 0.0005
# Normal run, with a single balance first
balance 1.0 shift xyz 100 1.0
run 20000
undump mdtraj
undump mdforces Example input geometry (frame 596 of the original dataset)
I ran these 8 jobs both with and without Kokkos, like this: Example job script#!/bin/bash
#SBATCH --job-name=model-unwrapped_08-aug-2024.pth-frame607
#SBATCH --account=...
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=128
#SBATCH --exclusive
#SBATCH --time=10:00
#SBATCH --error=vt_lammps%j.err
#SBATCH --output=vt_lammps%j.out
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#
#SBATCH --open-mode=append
# OpenMP parallelization
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
# By default, prefer the GCC10 build of LAMMPS + pair_allegro
module load lammps-tpc/2Aug23/gcc10-allegro-cpu
# Ensure that stack size is unlimited, or you may get a segfault error when
# attempting to run a MPI job.
ulimit -s unlimited
ulimit -S unlimited
ulimit -H unlimited
srun lmp -in input.lammps
# or "srun lmp -k on -sf kk -pk kokkos neigh full -in input.lammps", if running with Kokkos. Nearly all cases results in the Torch reshape error, after the simulation has proceeded for some number of steps. (The one case where it has not is a segfault.)
Additionally, I examined the domain decomposition chosen for one typical job, using the X, Y, and Z cut locations printed to screen by the LAMMPS However, my limited testing does support the idea that empty domains make the simulation more likely to crash from a Torch reshape error. On another system I'm researching with ~500 atoms (water solvated transition metal atoms), I see approximate inverse proportionality in the number of steps my MD simulation will complete before a Torch reshape crash and the number of domains:
This happens even despite the fact that using fewer domains for some reason tends to produce different results for the pre-MD conjugate gradient minimization. 16 MPI tasks produce a minimized geometry where atoms concentrated in one half of the square box, while larger numbers of MPI tasks produce more uniformly distributed geometry, so the fact that the 16-task job does not produce a Torch reshape error even at O(5000) steps makes empty domains seem a more likely cause of this error. I'm not sure what else to try. I've tried things like forcing a domain rebalance after each MD step and increasing the neighbor list and ghost atom communication cutoffs, but I'm still encountering Torch reshape errors for all but the smallest number of domains. Do you have any guidance on what to try, or tests you'd like me to run? Thanks! Edit: running larger systems with the same number of domain decompositions seems to work. The water system above is 64 waters in a box. If I instead replicate to 3 copies in each dimension (1728 waters), I can run 20k steps at 128 MPI tasks with no Torch reshape error, both with and without Kokkos. |
Hello all,
Very happy with the tools - thank you for maintaining them and integrating with LAMMPS. I am running some vacuum simulations and while increasing the number of GPUs (and mpi ranks), I ran into the following issue:
Seems that its related to the way LAMMPS partitions the simulation box with increasing number of processes:
https://docs.lammps.org/Developer_par_part.html
Where if the rank/partition running that model contains no atoms, the network potential cannot accept the zero size tensor. Is this the intended behaviour of allegro models in LAMMPS?
Happy to run some more tests/provide more information if needed.
The text was updated successfully, but these errors were encountered: