Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock and message file retention on Frontier. #67

Merged
merged 8 commits into from
Oct 27, 2023
Merged

Conversation

joaander
Copy link
Member

@joaander joaander commented Oct 10, 2023

Description

  • Remove simulation.walltime evaluations from inside rank == 0 checks.
  • Fix other miscellaneous issues on Frontier.

Motivation and context

  • communicator.walltime calls MPI_Bcast and therefore must be called collectively. I opted to remove the walltime message rather than call communicator.walltime collectively and print the result.
  • The message_file was overwritten on job restart. Tag the file with the slurm job id to keep old logs.
  • Prevent the new patchy particle test from running when HOOMD is built without LLVM.

How has this been tested?

For unknown reasons, this does not always deadlock as I would expect. It only deadlocks in very rare circumstances in job submissions on OLCF Frontier. I have tested the create_initial_state operations on Frontier and they work now.

TODO:

  • Test full workflow.

Checklist:

@joaander joaander mentioned this pull request Oct 10, 2023
3 tasks
@joaander joaander changed the title Fix deadlock. Fix deadlock and message file retention on Frontier. Oct 24, 2023
@joaander
Copy link
Member Author

The entire workflow runs on frontier.

@joaander joaander marked this pull request as ready for review October 27, 2023 14:00
@joaander joaander requested review from a team and tommy-waltmann and removed request for a team October 27, 2023 14:00
@joaander joaander merged commit 164a106 into trunk Oct 27, 2023
1 check passed
@joaander joaander deleted the fix-deadlock branch October 27, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants