Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulated annealing calculation error using pair-allegro #40

Closed
walker9564 opened this issue May 4, 2024 · 0 comments
Closed

Simulated annealing calculation error using pair-allegro #40

walker9564 opened this issue May 4, 2024 · 0 comments

Comments

@walker9564
Copy link

walker9564 commented May 4, 2024

OS: CentOS Linux release 7.9.2009 (Core)
Compiler: GCC 13.2.0
CPU: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
NUMA node(s): 2
pytorch:1.12.0
lammps version: 2021.09 release
mpi :intel parallel studio xe 2019

When I executed the simulated annealing algorithm on small clusters, I got the following error.

LAMMPS (29 Sep 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
units metal
atom_style atomic
boundary p p p

newton on

read_data in.data
Reading data file ...
orthogonal box = (0.0000000 0.0000000 0.0000000) to (20.000000 20.000000 20.000000)
1 by 1 by 1 MPI processor grid
reading atoms ...
12 atoms
read_data CPU = 0.003 seconds
#read_restart file.restart.100000

pair_style allegro
pair_coeff * * fe-total.pth Fe

timestep 0.001 # ps

thermo_style custom step dt time temp ke pe etotal press vol
thermo 20
dump 1 all custom 200 dump.lammpstrj id type x y z
restart 100000 file.restart
fix s1 all nvt temp 0.01 1000 $(100.0*dt)
fix s1 all nvt temp 0.01 1000 0.10000000000000000555
run 30000
Neighbor list info ...
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 4, bins = 5 5 5
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair allegro, perpetual
attributes: full, newton on, ghost
pair build: full/bin/ghost
stencil: full/ghost/bin/3d
bin: standard
Per MPI rank memory allocation (min/avg/max) = 4.315 | 4.315 | 4.315 Mbytes
Step Dt Time Temp KinEng PotEng TotEng Press Volume
0 0.001 0 0 0 -77.797695 -77.797695 0 8000
.......
.......
.......
470920 0.001 470.92 676.16539 0.9614136 -83.998843 -83.03743 128.36286 8000
470940 0.001 470.94 668.32156 0.95026076 -83.998562 -83.048301 126.87379 8000
470960 0.001 470.96 676.39779 0.96174404 -83.99844 -83.036696 128.40698 8000

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764

The input

file content is as follows。
units metal
atom_style atomic
boundary p p p
newton on
read_data in.data
#read_restart file.restart.100000

pair_style allegro
pair_coeff * * fe-total.pth Fe

timestep 0.001 # ps
thermo_style custom step dt time temp ke pe etotal press vol
thermo 20
dump 1 all custom 200 dump.lammpstrj id type x y z
restart 100000 file.restart
fix s1 all nvt temp 0.01 1000 $(100.0dt)
run 30000
unfix s1
fix s2 all nvt temp 1000 1000 $(100.0
dt)
run 100000
unfix s2
fix s3 all nvt temp 1000 50 $(100.0*dt)
run 6000000
unfix s3
write_data out.data

He did not complete the task. I need to perform 6130000 calculations, but the task ends around 470000 times. Then the error message above appears.
So I tried to use GDB to analyze the errors, but I am not very familiar with this aspect.

The analysis results are as follows.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x00007fffe0ff25ad in torch::jit::InterpreterStateImpl::callstack() const () from /opt/software/python3/lib/python3.7/site -packages/torch/lib/libtorch_cpu.so
#2 0x00007fffe0ff3e8e in torch::jit::InterpreterStateImpl::handleError(std::exception const&, bool, c10::NotImplementedError* , c10::optionalstd::string) ()
from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#3 0x00007fffe1000fd0 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocatorc10::IValue >&) ( ) from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#4 0x00007fffe0fee44f in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () from / opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#5 0x00007fffe0fe167a in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () f rom /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#6 0x00007fffe0c90ade in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordere d_map<std::string, c10::IValue, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const , c10::IValue> > > const&) const () from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#7 0x00000000006f3496 in torch::jit::Module::forward (this=this@entry=0x2c83a38, inputs=..., kwargs=...) at /opt/software/pyt hon3/lib/python3.7/site-packages/torch/include/torch/csrc/jit/api/module.h:114
#8 0x00000000006ef443 in LAMMPS_NS::PairAllegro::compute (this=0x2c836c0, eflag=, vflag=) at /o pt/source/lammps-stable_29Sep2021/src/pair_allegro.cpp:426
#9 0x00000000005379fb in LAMMPS_NS::Verlet::run (this=0x2c82c60, n=6000000) at /opt/source/lammps-stable_29Sep2021/src/verlet .cpp:312
#10 0x00000000004f291b in LAMMPS_NS::Run::command (this=, narg=, arg=) at /opt/so urce/lammps-stable_29Sep2021/src/run.cpp:180
#11 0x0000000000448614 in LAMMPS_NS::Input::execute_command (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input. cpp:794
#12 0x0000000000448c2c in LAMMPS_NS::Input::file (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input.cpp:273
#13 0x00000000004235a8 in main (argc=, argv=) at /opt/source/lammps-stable_29Sep2021/src/main.cp p:98

I noticed that it mentioned Segmentation fault, but I'm not sure how to solve this problem.I hope u can provide me with some valuable help.thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant