-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAG]Enable NPU (hccl) communication for CG #47658
base: master
Are you sure you want to change the base?
[ADAG]Enable NPU (hccl) communication for CG #47658
Conversation
Signed-off-by: zhilong <[email protected]>
cc @ruisearch42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a round of review since I was tagged.
Overall looks good. Do you plan to add a test?
Let me know when this is ready to review.
from ray.experimental.channel.nccl_group import _NcclGroup | ||
|
||
else: | ||
from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this looks like a hack. Do you plan to change to a cleaner approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I just remove this hack and left a comments in the test. After we refactor the channel we can have better solution.
@@ -16,7 +16,7 @@ | |||
@DeveloperAPI | |||
class GPUCommunicator(ABC): | |||
""" | |||
Communicator for a group of aDAG actors on Nvidia GPU. | |||
Communicator for a group of aDAG actors on Nvidia GPU or other XPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Next step I prefer to change it to AcceleratorCommunicator
or just Communicator
for all. Currently, this GPUCommunicator
is also called from some top level so I just keep it now.
self._device_id = device_id | ||
|
||
if rank is not None: | ||
assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray.get_gpu_ids()
seems to only get GPU IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I just changed it. Also I think there should be a API to get all Accelerator id?
if self._closed: | ||
raise RayChannelError("HCCL group has been destroyed.") | ||
|
||
self._comm.send(tensor=value, dst=peer_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I have is how is this different than nccl_collective_group
send/recv? It seems nccl_collective_group
just abstracts it higher level as _point2point, but otherwise identical to nccl_group
.
If it's supposed to channel-only then we can merge this hccl_group
, then later we'll open another PR for hccl_collective_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think collective is a more general module that can be used for all other ray module. While here we need a module specify for aDAG channel. I think we can try to have another PR for hccl_collective_group so it can be used as a utils so we can use the NPU easier. In collective we can try to solve the double import or other problems that we meet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in yesterday's aDAG meeting someone mentioned nccl_collective_group
is actually old code, and nccl_group
send/receive is what's currently used. We can discuss more to see how to extend it to support collectives it as apart of the refactor proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another PR to support collective fn as a node type #47621
I see they implemented collective/allreduce.py which calls allreduce
of the GPUCommunicator
in nccl_group.py
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
HI, @ruisearch42 Thanks for your suggestion! I just rewrite some of them and add a test here. The test is runnable on NPU but cannot run on GPU now, so it's a example to show how to run it. |
Signed-off-by: zhilong <[email protected]>
) | ||
|
||
torch_npu.npu.set_device(rank) # Set the NPU device according to the rank | ||
self.ctx = dist.init_process_group( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this process_group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha.. This is different from process_group....The ascend torch_npu is a little different when handling the distributed while other parts are the same. https://github.com/Ascend/pytorch/blob/868b6f8e00eb0fb179fe719a81e13d8ec1860873/test/distributed/test_send_recv.py#L25
Signed-off-by: zhilong <[email protected]>
sorry I will review it tomorrow! I have been off for some time |
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
|
||
import torch | ||
import torch.distributed as dist | ||
import torch_npu # The torch_npu for communicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does Ray install this package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but this will only be called if it's installed by checking NPU_TORCH_PACKAGE_AVAILABLE
tensor = torch.zeros(*shape, dtype=dtype).to(f"npu:{self._rank}") | ||
dist.recv(tensor, src=peer_rank) | ||
# torch.npu.synchronize(self._rank) | ||
if self._closed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._closed
will not be updated between L175 and L178. Do we need to check it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just fixed! As previous there are some issues when teardown the aDAG. But if fixed now. So I can remove this. Thanks for you suggestions!
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.