AssertionError thermostat.shape #405

rsdmse · 2024-07-12T13:18:57Z

I'm reaching out on behalf of a user on our cluster. Half way through an OTF training with SGP_Wrapper the job terminates with this error:

Traceback (most recent call last):
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/bin/flare-otf", line 8, in <module>
    sys.exit(main())
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 372, in main
    fresh_start_otf(config)
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 339, in fresh_start_otf
    otf.run()
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/learners/otf.py", line 433, in run
    self.md_step()  # update positions by Verlet
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/learners/otf.py", line 532, in md_step
    self.md.step(tol, self.number_of_steps)
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/md/lammps.py", line 289, in step
    self.backup(trj)
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/md/lammps.py", line 315, in backup
    assert thermostat.shape[0] == 2 * len(curr_trj) - 2 * n_iters
AssertionError

The tmp/log_<DATE> file looks normal as it ends with

if '$(c_MaxUnc) > 0.05' then quit
quit

What could be causing this issue or what are some of the things that we should be looking out for? (If you need to see the input files I'll have to ask for permission from the user.)

Also I have a general question about OTF's alternating MD (LAMMPS) - DFT (VASP) workflow in Slurm. Because the DFT step is the most intensive, the user needs to request a large amount of resources that is too excessive for the MD step. For instance, the job we're having problems with contains 100 atoms and is submitted to run on a few hundred cores. Based on what I've read (e.g. in this issue the developer recommended 40 cores for 62k atoms), having too many cores could be problematic. While we are not experiencing hanging, the performance seems to be very poor (17 timesteps/s) for such a small system. I wonder if you have any suggestions to improve the performance and the overall efficiency of the OTF workflow.

The text was updated successfully, but these errors were encountered:

rsdmse · 2024-07-12T13:47:24Z

I forgot to mention that we're using Flare 1.3.0 and LAMMPS 23Jun2022. Should we upgrade to the latest versions of Flare and LAMMPS?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError thermostat.shape #405

AssertionError thermostat.shape #405

rsdmse commented Jul 12, 2024

rsdmse commented Jul 12, 2024

AssertionError thermostat.shape #405

AssertionError thermostat.shape #405

Comments

rsdmse commented Jul 12, 2024

rsdmse commented Jul 12, 2024