Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAG]Enable NPU (hccl) communication for CG #47658

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

Bye-legumes
Copy link
Contributor

@Bye-legumes Bye-legumes commented Sep 13, 2024

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: zhilong <[email protected]>
@Bye-legumes Bye-legumes changed the title [WIP] Enable NPU communication for aDAG [WIP][ADAG]Enable NPU (hccl) communication for aDAG Sep 13, 2024
@rkooo567
Copy link
Contributor

cc @ruisearch42

Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a round of review since I was tagged.

Overall looks good. Do you plan to add a test?

Let me know when this is ready to review.

from ray.experimental.channel.nccl_group import _NcclGroup

else:
from ray.experimental.channel.hccl_group import _HcclGroup as _NcclGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this looks like a hack. Do you plan to change to a cleaner approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I just remove this hack and left a comments in the test. After we refactor the channel we can have better solution.

@@ -16,7 +16,7 @@
@DeveloperAPI
class GPUCommunicator(ABC):
"""
Communicator for a group of aDAG actors on Nvidia GPU.
Communicator for a group of aDAG actors on Nvidia GPU or other XPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably change the class name to a more general one if this is to support other XPUs. This is not yet used externally so backward compatibility is not an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Next step I prefer to change it to AcceleratorCommunicator or just Communicator for all. Currently, this GPUCommunicator is also called from some top level so I just keep it now.

self._device_id = device_id

if rank is not None:
assert ray.get_gpu_ids(), "HCCL actor has no NPUs assigned"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.get_gpu_ids() seems to only get GPU IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I just changed it. Also I think there should be a API to get all Accelerator id?

python/ray/experimental/channel/hccl_group.py Outdated Show resolved Hide resolved
@anyscalesam anyscalesam added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core compiled-graphs labels Sep 16, 2024
if self._closed:
raise RayChannelError("HCCL group has been destroyed.")

self._comm.send(tensor=value, dst=peer_rank)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I have is how is this different than nccl_collective_group send/recv? It seems nccl_collective_group just abstracts it higher level as _point2point, but otherwise identical to nccl_group.

If it's supposed to channel-only then we can merge this hccl_group, then later we'll open another PR for hccl_collective_group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collective is a more general module that can be used for all other ray module. While here we need a module specify for aDAG channel. I think we can try to have another PR for hccl_collective_group so it can be used as a utils so we can use the NPU easier. In collective we can try to solve the double import or other problems that we meet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in yesterday's aDAG meeting someone mentioned nccl_collective_group is actually old code, and nccl_group send/receive is what's currently used. We can discuss more to see how to extend it to support collectives it as apart of the refactor proposal.

Copy link

@arcyleung arcyleung Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another PR to support collective fn as a node type #47621

I see they implemented collective/allreduce.py which calls allreduce of the GPUCommunicator in nccl_group.py

Bye-legumes and others added 5 commits September 25, 2024 13:43
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
@Bye-legumes Bye-legumes changed the title [WIP][ADAG]Enable NPU (hccl) communication for aDAG [ADAG]Enable NPU (hccl) communication for aDAG Sep 25, 2024
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
@Bye-legumes
Copy link
Contributor Author

HI, @ruisearch42 Thanks for your suggestion! I just rewrite some of them and add a test here. The test is runnable on NPU but cannot run on GPU now, so it's a example to show how to run it.
Our future plan will rector the aDAG channel first so it can support other device more easier.

python/ray/dag/tests/experimental/test_torch_tensor_dag.py Outdated Show resolved Hide resolved
python/ray/experimental/channel/hccl_group.py Show resolved Hide resolved
python/ray/experimental/channel/hccl_group.py Show resolved Hide resolved
python/ray/experimental/channel/hccl_group.py Show resolved Hide resolved
)

torch_npu.npu.set_device(rank) # Set the NPU device according to the rank
self.ctx = dist.init_process_group(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this process_group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha.. This is different from process_group....The ascend torch_npu is a little different when handling the distributed while other parts are the same. https://github.com/Ascend/pytorch/blob/868b6f8e00eb0fb179fe719a81e13d8ec1860873/test/distributed/test_send_recv.py#L25

Signed-off-by: zhilong <[email protected]>
@rkooo567
Copy link
Contributor

sorry I will review it tomorrow! I have been off for some time

Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
@kevin85421 kevin85421 self-assigned this Nov 22, 2024
python/ray/experimental/channel/torch_tensor_type.py Outdated Show resolved Hide resolved

import torch
import torch.distributed as dist
import torch_npu # The torch_npu for communicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Ray install this package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but this will only be called if it's installed by checking NPU_TORCH_PACKAGE_AVAILABLE

python/ray/experimental/channel/hccl_group.py Outdated Show resolved Hide resolved
python/ray/experimental/channel/hccl_group.py Show resolved Hide resolved
python/ray/experimental/channel/hccl_group.py Outdated Show resolved Hide resolved
python/ray/experimental/channel/hccl_group.py Outdated Show resolved Hide resolved
tensor = torch.zeros(*shape, dtype=dtype).to(f"npu:{self._rank}")
dist.recv(tensor, src=peer_rank)
# torch.npu.synchronize(self._rank)
if self._closed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._closed will not be updated between L175 and L178. Do we need to check it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just fixed! As previous there are some issues when teardown the aDAG. But if fixed now. So I can remove this. Thanks for you suggestions!

@Bye-legumes Bye-legumes changed the title [ADAG]Enable NPU (hccl) communication for aDAG [ADAG]Enable NPU (hccl) communication for CG Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiled-graphs core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants