NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS #222

liuxingbo12138 · 2024-06-07T06:12:52Z

when i run nccl-test with sharp, i meet the error, what cause this
I tested using the NGC 24.05 version image, so it shouldn't be an environmental issue, if I remove the cuda_copy parameter from NCCL_UCX-TLS=rc_x, it can run normally, and there will also be job creation related to sharp in ufm, but the speed is the same as not driving Sharp
mpirun -mca plm_rsh_args "-p 12138" --allow-run-as-root --bind-to socket -x LD_LIBRARY_PATH -x NCCL_UCX_RNDV_THRESH=0 -x UCX_MEMTYPE_CACHE=n -x NCCL_COLLNET_ENABLE=1 -x NCCL_PLUGIN_P2P=ucx -x NCCL_DEBUG_SUBSYS=NET -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17 -x NCCL_SOCKET_IFNAME=eth5 -x NCCL_COLLNET_ENABLE=1 --host 10.101.42.2:8,10.101.42.3:8,10.101.42.4:8,10.101.42.5:8 ./build/all_reduce_perf -b 4G -e 4G -f 0 -i 0 -g 1

The text was updated successfully, but these errors were encountered:

AddyLaddy · 2024-06-07T06:17:53Z

Do you need to use -x NCCL_PLUGIN_P2P=ucx I don't believe that is the default for the SHARP plugin.
You don't need UCX in order to use SHARP in the external plugin.

liuxingbo12138 · 2024-06-07T08:17:49Z

Do you need to use -x NCCL_PLUGIN_P2P=ucx I don't believe that is the default for the SHARP plugin. You don't need UCX in order to use SHARP in the external plugin.

When I run megatron llm, if I don't specify NCCL-PLUGIN-P2P=ucx and NCCL_UCX-TLS=rc_x, cuda_copy parameters, it will report the following error

liuxingbo12138 · 2024-06-07T08:19:32Z

I want to know how much difference there is between using sharp and not using sharp in large model training

AddyLaddy · 2024-06-07T16:55:54Z

Is SHARP enabled and configured on your system? I think you need to contact the system vendor or an Nvidia support representative/SA in order to be able to diagnose this complex issue with using SHARP and LLM training.

liuxingbo12138 · 2024-06-17T06:40:28Z

Is SHARP enabled and configured on your system? I think you need to contact the system vendor or an Nvidia support representative/SA in order to be able to diagnose this complex issue with using SHARP and LLM training.

OK，thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS #222

NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS #222

liuxingbo12138 commented Jun 7, 2024

AddyLaddy commented Jun 7, 2024

liuxingbo12138 commented Jun 7, 2024

liuxingbo12138 commented Jun 7, 2024

AddyLaddy commented Jun 7, 2024

liuxingbo12138 commented Jun 17, 2024

NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS #222

NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS #222

Comments

liuxingbo12138 commented Jun 7, 2024

AddyLaddy commented Jun 7, 2024

liuxingbo12138 commented Jun 7, 2024

liuxingbo12138 commented Jun 7, 2024

AddyLaddy commented Jun 7, 2024

liuxingbo12138 commented Jun 17, 2024