[ADAG]Enable NPU (hccl) communication for CG #47658

Bye-legumes · 2024-09-13T20:00:19Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: zhilong <[email protected]>

rkooo567 · 2024-09-16T05:09:42Z

cc @ruisearch42

ruisearch42

Having a round of review since I was tagged.

Overall looks good. Do you plan to add a test?

Let me know when this is ready to review.

ruisearch42 · 2024-09-16T15:25:52Z

python/ray/experimental/channel/torch_tensor_nccl_channel.py

+    from ray.experimental.channel.nccl_group import _NcclGroup
+
+else:
+    from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup


hmm, this looks like a hack. Do you plan to change to a cleaner approach?

OK, I just remove this hack and left a comments in the test. After we refactor the channel we can have better solution.

ruisearch42 · 2024-09-16T15:27:30Z

python/ray/experimental/channel/gpu_communicator.py

@@ -16,7 +16,7 @@
 @DeveloperAPI
 class GPUCommunicator(ABC):
    """
-    Communicator for a group of aDAG actors on Nvidia GPU.
+    Communicator for a group of aDAG actors on Nvidia GPU or other XPUs.


We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.

I agree. Next step I prefer to change it to AcceleratorCommunicator or just Communicator for all. Currently, this GPUCommunicator is also called from some top level so I just keep it now.

ruisearch42 · 2024-09-16T15:34:53Z

python/ray/experimental/channel/hccl_group.py

+        self._device_id = device_id
+
+        if rank is not None:
+            assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned"


ray.get_gpu_ids() seems to only get GPU IDs?

True, I just changed it. Also I think there should be a API to get all Accelerator id?

python/ray/experimental/channel/hccl_group.py

arcyleung · 2024-09-19T21:00:52Z

python/ray/experimental/channel/hccl_group.py

+        if self._closed:
+            raise RayChannelError("HCCL group has been destroyed.")
+
+        self._comm.send(tensor=value, dst=peer_rank)


One question I have is how is this different than nccl_collective_group send/recv? It seems nccl_collective_group just abstracts it higher level as _point2point, but otherwise identical to nccl_group.

If it's supposed to channel-only then we can merge this hccl_group, then later we'll open another PR for hccl_collective_group

I think collective is a more general module that can be used for all other ray module. While here we need a module specify for aDAG channel. I think we can try to have another PR for hccl_collective_group so it can be used as a utils so we can use the NPU easier. In collective we can try to solve the double import or other problems that we meet.

So in yesterday's aDAG meeting someone mentioned nccl_collective_group is actually old code, and nccl_group send/receive is what's currently used. We can discuss more to see how to extend it to support collectives it as apart of the refactor proposal.

There is another PR to support collective fn as a node type #47621

I see they implemented collective/allreduce.py which calls allreduce of the GPUCommunicator in nccl_group.py

Signed-off-by: zhilong <[email protected]>

Bye-legumes · 2024-09-25T19:58:35Z

HI, @ruisearch42 Thanks for your suggestion! I just rewrite some of them and add a test here. The test is runnable on NPU but cannot run on GPU now, so it's a example to show how to run it.
Our future plan will rector the aDAG channel first so it can support other device more easier.

Signed-off-by: zhilong <[email protected]>

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

python/ray/experimental/channel/hccl_group.py

ruisearch42 · 2024-10-10T18:01:34Z

python/ray/experimental/channel/hccl_group.py

+        )
+
+        torch_npu.npu.set_device(rank)  # Set the NPU device according to the rank
+        self.ctx = dist.init_process_group(


Should we call this process_group?

aha.. This is different from process_group....The ascend torch_npu is a little different when handling the distributed while other parts are the same. https://github.com/Ascend/pytorch/blob/868b6f8e00eb0fb179fe719a81e13d8ec1860873/test/distributed/test_send_recv.py#L25

Signed-off-by: zhilong <[email protected]>

rkooo567 · 2024-10-21T03:20:50Z

sorry I will review it tomorrow! I have been off for some time

Signed-off-by: zhilong <[email protected]>

python/ray/experimental/channel/torch_tensor_type.py

python/ray/experimental/channel/torch_tensor_nccl_channel.py

kevin85421 · 2024-11-22T07:08:37Z

python/ray/experimental/channel/hccl_group.py

+
+import torch
+import torch.distributed as dist
+import torch_npu  # The torch_npu for communicate


does Ray install this package?

No, but this will only be called if it's installed by checking NPU_TORCH_PACKAGE_AVAILABLE

python/ray/experimental/channel/hccl_group.py

kevin85421 · 2024-11-22T07:16:17Z

python/ray/experimental/channel/hccl_group.py

+        tensor = torch.zeros(*shape, dtype=dtype).to(f"npu:{self._rank}")
+        dist.recv(tensor, src=peer_rank)
+        # torch.npu.synchronize(self._rank)
+        if self._closed:


self._closed will not be updated between L175 and L178. Do we need to check it again?

I just fixed! As previous there are some issues when teardown the aDAG. But if fixed now. So I can remove this. Thanks for you suggestions!

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: zhilong <[email protected]>

Signed-off-by: zhilong <[email protected]>

fix

e1743e9

Signed-off-by: zhilong <[email protected]>

Bye-legumes changed the title ~~[WIP] Enable NPU communication for aDAG~~ [WIP][ADAG]Enable NPU (hccl) communication for aDAG Sep 13, 2024

rkooo567 assigned rkooo567 and ruisearch42 Sep 15, 2024

ruisearch42 reviewed Sep 16, 2024

View reviewed changes

anyscalesam added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core compiled-graphs labels Sep 16, 2024

Bye-legumes added 2 commits September 19, 2024 16:35

Merge branch 'master' into AcceleratorCommunicator

38bd814

Merge branch 'master' into AcceleratorCommunicator

90cc57a

arcyleung reviewed Sep 23, 2024

View reviewed changes

Bye-legumes and others added 5 commits September 25, 2024 13:43

Merge branch 'master' into AcceleratorCommunicator

305a930

fix

3ef20be

Signed-off-by: zhilong <[email protected]>

fix

791a728

Signed-off-by: zhilong <[email protected]>

fix

6df1ee6

Signed-off-by: zhilong <[email protected]>

fix

eed8f50

Signed-off-by: zhilong <[email protected]>

Bye-legumes changed the title ~~[WIP][ADAG]Enable NPU (hccl) communication for aDAG~~ [ADAG]Enable NPU (hccl) communication for aDAG Sep 25, 2024

Bye-legumes added 2 commits September 25, 2024 15:38

fix

20a9d24

Signed-off-by: zhilong <[email protected]>

fix

90d8536

Signed-off-by: zhilong <[email protected]>

Bye-legumes and others added 6 commits September 26, 2024 11:01

Merge branch 'master' into AcceleratorCommunicator

66b3a90

Merge branch 'master' into AcceleratorCommunicator

94eb57a

Merge branch 'master' into AcceleratorCommunicator

e3cfcce

Merge branch 'master' into AcceleratorCommunicator

1adf211

Merge branch 'master' into AcceleratorCommunicator

2a84356

fix

b846fd6

Signed-off-by: zhilong <[email protected]>

Bye-legumes mentioned this pull request Oct 9, 2024

[ADAG] Refactor nccl to communicator channel. #47845

Closed

5 tasks

ruisearch42 reviewed Oct 10, 2024

View reviewed changes

fix

5084059

Signed-off-by: zhilong <[email protected]>

Bye-legumes added 2 commits October 11, 2024 11:07

Merge branch 'master' into AcceleratorCommunicator

f1cf7a3

Merge branch 'master' into AcceleratorCommunicator

c5729e3

Bye-legumes added 3 commits November 18, 2024 13:47

fix

aeb12de

Signed-off-by: zhilong <[email protected]>

fix

3bcc211

Signed-off-by: zhilong <[email protected]>

fix

13c77e8

Signed-off-by: zhilong <[email protected]>

Bye-legumes mentioned this pull request Nov 18, 2024

[ADAG] Refactor nccl to communicator channel. #48607

Closed

5 tasks

Bye-legumes and others added 4 commits November 18, 2024 16:22

fix

4c8fd87

Signed-off-by: zhilong <[email protected]>

Merge branch 'master' into AcceleratorCommunicator

81da1f0

Merge branch 'master' into AcceleratorCommunicator

ff13bed

Merge branch 'master' into AcceleratorCommunicator

408f82b

kevin85421 self-assigned this Nov 22, 2024

kevin85421 reviewed Nov 22, 2024

View reviewed changes

Bye-legumes and others added 3 commits November 22, 2024 14:08

Update python/ray/experimental/channel/torch_tensor_type.py

f225933

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: zhilong <[email protected]>

Dfix

bc1a005

Signed-off-by: zhilong <[email protected]>

Merge branch 'master' into AcceleratorCommunicator

e55ea68

Bye-legumes changed the title ~~[ADAG]Enable NPU (hccl) communication for aDAG~~ [ADAG]Enable NPU (hccl) communication for CG Nov 22, 2024

Bye-legumes and others added 3 commits November 22, 2024 16:50

Merge branch 'master' into AcceleratorCommunicator

ed8def5

fix

87cddcf

Signed-off-by: zhilong <[email protected]>

fix

3ffdef5

Signed-off-by: zhilong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAG]Enable NPU (hccl) communication for CG #47658

[ADAG]Enable NPU (hccl) communication for CG #47658

Bye-legumes commented Sep 13, 2024 •

edited

Loading

rkooo567 commented Sep 16, 2024

ruisearch42 left a comment

ruisearch42 Sep 16, 2024

Bye-legumes Sep 25, 2024

ruisearch42 Sep 16, 2024

Bye-legumes Sep 25, 2024

ruisearch42 Sep 16, 2024

Bye-legumes Sep 25, 2024

arcyleung Sep 19, 2024

Bye-legumes Sep 25, 2024

arcyleung Sep 25, 2024

arcyleung Sep 27, 2024 •

edited

Loading

Bye-legumes commented Sep 25, 2024

ruisearch42 Oct 10, 2024

Bye-legumes Oct 10, 2024

rkooo567 commented Oct 21, 2024

kevin85421 Nov 22, 2024

Bye-legumes Nov 22, 2024

kevin85421 Nov 22, 2024

Bye-legumes Nov 22, 2024

[ADAG]Enable NPU (hccl) communication for CG #47658

Are you sure you want to change the base?

[ADAG]Enable NPU (hccl) communication for CG #47658

Conversation

Bye-legumes commented Sep 13, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 commented Sep 16, 2024

ruisearch42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcyleung Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Bye-legumes commented Sep 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bye-legumes commented Sep 13, 2024 •

edited

Loading

arcyleung Sep 27, 2024 •

edited

Loading