Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implement gloo abort for graceful shutdown #388

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Aidyn-A
Copy link
Contributor

@Aidyn-A Aidyn-A commented Sep 25, 2024

In pytorch/pytorch#130345 it was requested to implement a ProcessGroupGloo.shutdown() for faster recovery from distributed rank failures. This PR is a first step into accomplishing the proper shutdown. The second step would be implementing gloo::abort() within the PyTorch's ProcessGroupGloo.

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Nov 1, 2024

Sorry for the delay. Are you able to add a test for this change?

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Nov 1, 2024

Ignore the CI breakage for now. I'm trying to revive the CI for this repository.

@Aidyn-A
Copy link
Contributor Author

Aidyn-A commented Nov 5, 2024

Sorry for the delay. Are you able to add a test for this change?

Sure, I will add a test and resolve the merge conflicts soon.

@Aidyn-A
Copy link
Contributor Author

Aidyn-A commented Nov 15, 2024

Hey @c-p-i-o how does the PR look to you? Do you think it is ready to merge? Please let me know if you have any comments.

@Aidyn-A Aidyn-A requested a review from c-p-i-o November 15, 2024 14:37
@facebook-github-bot
Copy link

@c-p-i-o has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Aidyn-A
Copy link
Contributor Author

Aidyn-A commented Dec 5, 2024

Hey @c-p-i-o can you please let me know what tests are failing?
Also what kind of linter is used? Would just clang-format be enough to resolve lint errors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants