[aDAG] Support all reduce collective in aDAG #47621

dengwxn · 2024-09-12T06:04:44Z

Why are these changes needed?

aDAG currently does not support collective APIs. We would like to add support for collective APIs, starting from allreduce.

This PR support allreduce by introducing a syntax sugar ray.experimental.collective.allreduce.bind. The bind accepts arguments input_nodes, op, and transport. It returns a list of CollectiveOutputNode as the allreduce results, with the same size of input_nodes. The allreduce results write to newly allocated tensors. In the COMPUTE operation of CollectiveOutputNode, the corresponding NCCL collective API is called. There are no required changes for the input and output channels of CollectiveOutputNode.

Proposed new API:

import ray.experimental.collective as collective

with InputNode() as inp:
    dag = [worker.return_tensor.bind(inp) for worker in workers]
    dag = collective.allreduce.bind(dag, ReduceOp.SUM)
    dag = MultiOutputNode(dag)

API Requirements:

Input nodes are unique.
Actor handles are unique.
Actor handles match the custom NCCL group if specified.
All tensors have the same shape.

Requirements 1-3 are checked in the _CollectiveOperation constructor. Requirement 4 is checked by runtime timeout.

The operation scheduling is also updated to consider the NCCL collective operation. When a NCCL collective node is selected, all the corresponding collective nodes in the collective group should be selected as well.

Related issue number

Meta-issue: #47983

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

The input index is used when a task returns multiple values. An index is used to get the corresponding value of the tuple. Signed-off-by: Weixin Deng <[email protected]>

Signed-off-by: Weixin Deng <[email protected]>

The downstream tasks of a TaskReturnNode should be readers of a output channel. The task of a TaskRetureNode should have a copy of the output channel in the task of its upstream ClassMethodNode. Signed-off-by: Weixin Deng <[email protected]>

Signed-off-by: Weixin Deng <[email protected]>

Signed-off-by: Yuhan Ruan <[email protected]>

Signed-off-by: Weixin Deng <[email protected]>

stephanie-wang

Nice work! Left a few remaining comments for cleanups and to see if we can further cut down on the tests that need GPUs to run. The only functionality we really need to test for GPUs is:

does it provide the expected allreduce result?
does passing a custom communicator actually result in calling the custom communicator during execution?

Otherwise, all logic should go in unit tests that don't need GPUs to run.

python/ray/dag/dag_node_operation.py

python/ray/dag/tests/experimental/test_collective_dag.py

python/ray/experimental/collective/allreduce.py

python/ray/dag/tests/experimental/test_torch_tensor_dag.py

Signed-off-by: Yuhan Ruan <[email protected]>

Signed-off-by: Weixin Deng <[email protected]>

Signed-off-by: Yuhan Ruan <[email protected]>

Move Deduplicate P2P & Collective

Signed-off-by: Weixin Deng <[email protected]>

aDAG currently does not support collective APIs. We would like to add support for collective APIs, starting from allreduce. This PR support allreduce by introducing a syntax sugar `ray.experimental.collective.allreduce.bind`. The `bind` accepts arguments `input_nodes`, `op`, and `transport`. It returns a list of `CollectiveOutputNode` as the allreduce results, with the same size of `input_nodes`. The allreduce results write to newly allocated tensors. In the `COMPUTE` operation of `CollectiveOutputNode`, the corresponding NCCL collective API is called. There are no required changes for the input and output channels of `CollectiveOutputNode`. Proposed new API: ```python import ray.experimental.collective as collective with InputNode() as inp: dag = [worker.return_tensor.bind(inp) for worker in workers] dag = collective.allreduce.bind(dag, ReduceOp.SUM) dag = MultiOutputNode(dag) ``` API Requirements: 1. Input nodes are unique. 2. Actor handles are unique. 3. Actor handles match the custom NCCL group if specified. 4. All tensors have the same shape. Requirements 1-3 are checked in the `_CollectiveOperation` constructor. Requirement 4 is checked by runtime timeout. The operation scheduling is also updated to consider the NCCL collective operation. When a NCCL collective node is selected, all the corresponding collective nodes in the collective group should be selected as well. Meta-issue: ray-project#47983 --------- Signed-off-by: Weixin Deng <[email protected]> Signed-off-by: Yuhan Ruan <[email protected]> Co-authored-by: Yuhan Ruan <[email protected]>

aDAG currently does not support collective APIs. We would like to add support for collective APIs, starting from allreduce. This PR support allreduce by introducing a syntax sugar `ray.experimental.collective.allreduce.bind`. The `bind` accepts arguments `input_nodes`, `op`, and `transport`. It returns a list of `CollectiveOutputNode` as the allreduce results, with the same size of `input_nodes`. The allreduce results write to newly allocated tensors. In the `COMPUTE` operation of `CollectiveOutputNode`, the corresponding NCCL collective API is called. There are no required changes for the input and output channels of `CollectiveOutputNode`. Proposed new API: ```python import ray.experimental.collective as collective with InputNode() as inp: dag = [worker.return_tensor.bind(inp) for worker in workers] dag = collective.allreduce.bind(dag, ReduceOp.SUM) dag = MultiOutputNode(dag) ``` API Requirements: 1. Input nodes are unique. 2. Actor handles are unique. 3. Actor handles match the custom NCCL group if specified. 4. All tensors have the same shape. Requirements 1-3 are checked in the `_CollectiveOperation` constructor. Requirement 4 is checked by runtime timeout. The operation scheduling is also updated to consider the NCCL collective operation. When a NCCL collective node is selected, all the corresponding collective nodes in the collective group should be selected as well. Meta-issue: ray-project#47983 --------- Signed-off-by: Weixin Deng <[email protected]> Signed-off-by: Yuhan Ruan <[email protected]> Co-authored-by: Yuhan Ruan <[email protected]> Signed-off-by: mohitjain2504 <[email protected]>

dengwxn added 30 commits August 8, 2024 19:25

(WIP) chore: Add input index in reader

dd8afbe

The input index is used when a task returns multiple values. An index is used to get the corresponding value of the tuple. Signed-off-by: Weixin Deng <[email protected]>

chore: Clean up

d957669

Signed-off-by: Weixin Deng <[email protected]>

(WIP) chore: Add tests

952daa0

Signed-off-by: Weixin Deng <[email protected]>

chore: Return TaskReturnNode only when num_returns > 1

3479587

Signed-off-by: Weixin Deng <[email protected]>

chore: Add test for two returns three actors

3312712

Signed-off-by: Weixin Deng <[email protected]>

chore: Clean up

97faeb7

Signed-off-by: Weixin Deng <[email protected]>

chore: Adjust input_idxs

197e65e

Signed-off-by: Weixin Deng <[email protected]>

(WIP) chore: Allocate an output channel for each output value

a48fecd

Signed-off-by: Weixin Deng <[email protected]>

(WIP) chore: Remove legacy dependencies

16e0323

Signed-off-by: Weixin Deng <[email protected]>

(WIP) chore: Remove legacy dependencies

d12f28b

Signed-off-by: Weixin Deng <[email protected]>

chore: Update async writer

00a6de7

Signed-off-by: Weixin Deng <[email protected]>

chore: Update comments

99a42f6

Signed-off-by: Weixin Deng <[email protected]>

chore: Remove legacy

40c6bb2

Signed-off-by: Weixin Deng <[email protected]>

chore: Update tests

6ef6e60

Signed-off-by: Weixin Deng <[email protected]>

chore: Revert read_by_multi_output_node

6772b3a

Signed-off-by: Weixin Deng <[email protected]>

chore: Add comment

b35e632

Signed-off-by: Weixin Deng <[email protected]>

chore: Rename output task to upstream/downstream task

f2004a3

Signed-off-by: Weixin Deng <[email protected]>

chore: Update tests

a65b053

Signed-off-by: Weixin Deng <[email protected]>

chore: Update private fields

db1d796

Signed-off-by: Weixin Deng <[email protected]>

Merge branch 'master' into master

ce4694e

Signed-off-by: Weixin Deng <[email protected]>

feat: Merge ClassMethodOutputNode into ClassMethodNode

d2d50f3

Signed-off-by: Weixin Deng <[email protected]>

Merge branch 'master' into master

e64a3da

chore: Add tests and unify internal fields

007f656

Signed-off-by: Weixin Deng <[email protected]>

chore: Adjust comments and tests

c8e6a5b

Signed-off-by: Weixin Deng <[email protected]>

chore: Format code

c729e68

Signed-off-by: Weixin Deng <[email protected]>

chore: Merge master

39b1953

Signed-off-by: Weixin Deng <[email protected]>

chore: Fix check of readers for each output

39a76c5

Signed-off-by: Weixin Deng <[email protected]>

chore: Format code

a84ad8b

Signed-off-by: Weixin Deng <[email protected]>

chore: Format code

ce472ae

Signed-off-by: Weixin Deng <[email protected]>

AndyUB and others added 9 commits October 17, 2024 16:26

chore: Polish tests

00949bd

Signed-off-by: Yuhan Ruan <[email protected]>

merge: Upstream

6c80c59

Signed-off-by: Yuhan Ruan <[email protected]>

refactor: Test separate types

6e71c93

Signed-off-by: Weixin Deng <[email protected]>

refactor: Test original types

ccea7a3

Signed-off-by: Weixin Deng <[email protected]>

refactor: Use separate types by if-else

f634bbb

Signed-off-by: Weixin Deng <[email protected]>

refactor: Use separate types by if-else

ccbf68b

Signed-off-by: Weixin Deng <[email protected]>

refactor: Convert to ray op

70ea96d

Signed-off-by: Weixin Deng <[email protected]>

revert: Skip ray types

4b77907

Signed-off-by: Weixin Deng <[email protected]>

revert: Skip ray types

c5e2c20

Signed-off-by: Weixin Deng <[email protected]>

stephanie-wang approved these changes Oct 18, 2024

View reviewed changes

AndyUB and others added 12 commits October 18, 2024 16:40

merge: Upstream

5e94216

Signed-off-by: Yuhan Ruan <[email protected]>

chore: Cleanup tests

09c6709

Signed-off-by: Yuhan Ruan <[email protected]>

chore: Code review

b8d1891

Signed-off-by: Weixin Deng <[email protected]>

chore: Simplify tests

b3770a6

Signed-off-by: Yuhan Ruan <[email protected]>

merge: Upstream

c249ec3

Signed-off-by: Yuhan Ruan <[email protected]>

chore: Polish tests

ff7f720

Signed-off-by: Yuhan Ruan <[email protected]>

Merge pull request #15 from AndyUB/test-1017

d8c85fb

Move Deduplicate P2P & Collective

refactor: Polish tests

0486914

Signed-off-by: Weixin Deng <[email protected]>

chore: Format

6d0db72

Signed-off-by: Weixin Deng <[email protected]>

Merge branch 'master' into ccar-0905

2d59839

test: Mock gpus

860139c

Signed-off-by: Weixin Deng <[email protected]>

merge: Upstream branch

7df480f

Signed-off-by: Weixin Deng <[email protected]>

dengwxn force-pushed the ccar-0905 branch from c07ccdd to 7df480f Compare October 20, 2024 05:35

stephanie-wang merged commit 2c68a4b into ray-project:master Oct 21, 2024
5 checks passed

This was referenced Oct 21, 2024

[core][compiled graphs] Overlap computation and communication #47586

Merged

CI test linux://python/ray/dag:test_torch_tensor_dag_gpu is flaky #45920

Closed

dengwxn deleted the ccar-0905 branch October 25, 2024 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aDAG] Support all reduce collective in aDAG #47621

[aDAG] Support all reduce collective in aDAG #47621

dengwxn commented Sep 12, 2024 •

edited

Loading

stephanie-wang left a comment

[aDAG] Support all reduce collective in aDAG #47621

[aDAG] Support all reduce collective in aDAG #47621

Conversation

dengwxn commented Sep 12, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang left a comment

Choose a reason for hiding this comment

dengwxn commented Sep 12, 2024 •

edited

Loading