-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Implement gloo abort for graceful shutdown #388
base: main
Are you sure you want to change the base?
Conversation
Sorry for the delay. Are you able to add a test for this change? |
Ignore the CI breakage for now. I'm trying to revive the CI for this repository. |
Sure, I will add a test and resolve the merge conflicts soon. |
Hey @c-p-i-o how does the PR look to you? Do you think it is ready to merge? Please let me know if you have any comments. |
@c-p-i-o has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hey @c-p-i-o can you please let me know what tests are failing? |
In pytorch/pytorch#130345 it was requested to implement a
ProcessGroupGloo.shutdown()
for faster recovery from distributed rank failures. This PR is a first step into accomplishing the proper shutdown. The second step would be implementinggloo::abort()
within the PyTorch'sProcessGroupGloo
.