You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the rank has more than 2^31 bytes of data to write, it will break up the write and do 1 MeshBlock at a time, but this decision is based on a local value. What if rank 0 has 3 MeshBlocks, pushing it over the 2^31 threshold, but rank 1 has 2 MeshBlocks, keeping it under? It seems rank 1 will take the first branch and trigger a single MPI_file_write_at_all, while rank 0 will take the second branch. Rank 0 will then trigger 2 MPI_file_write_at_all calls followed by a single MPI_file_write_at for its 3 MeshBlocks. It seems this will hang with an unbalanced number of collective writes.
The text was updated successfully, but these errors were encountered:
As an update to this issue, the binary files should be writing Reals, not bytes so as to increase the maximum size that can be written, and to get rid of what looks like an unnecessary memcopy. This should be done as part of an overall update of the IOWrapper class to read/write any_type, which was started when particle outputs were added.
In GitLab by @c-white on Dec 2, 2023, 10:57
I just want to check that there isn't a subtle possibility of a hanging bug. Consider the code for outputting full dumps:
https://gitlab.com/theias/hpc/jmstone/athena-parthenon/athenak/-/blob/master/src/outputs/binary.cpp#L226-261
If the rank has more than 2^31 bytes of data to write, it will break up the write and do 1 MeshBlock at a time, but this decision is based on a local value. What if rank 0 has 3 MeshBlocks, pushing it over the 2^31 threshold, but rank 1 has 2 MeshBlocks, keeping it under? It seems rank 1 will take the first branch and trigger a single
MPI_file_write_at_all
, while rank 0 will take the second branch. Rank 0 will then trigger 2MPI_file_write_at_all
calls followed by a singleMPI_file_write_at
for its 3 MeshBlocks. It seems this will hang with an unbalanced number of collective writes.The text was updated successfully, but these errors were encountered: