Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS #222

Open
liuxingbo12138 opened this issue Jun 7, 2024 · 5 comments

Comments

@liuxingbo12138
Copy link

image
when i run nccl-test with sharp, i meet the error, what cause this
I tested using the NGC 24.05 version image, so it shouldn't be an environmental issue, if I remove the cuda_copy parameter from NCCL_UCX-TLS=rc_x, it can run normally, and there will also be job creation related to sharp in ufm, but the speed is the same as not driving Sharp
mpirun -mca plm_rsh_args "-p 12138" --allow-run-as-root --bind-to socket -x LD_LIBRARY_PATH -x NCCL_UCX_RNDV_THRESH=0 -x UCX_MEMTYPE_CACHE=n -x NCCL_COLLNET_ENABLE=1 -x NCCL_PLUGIN_P2P=ucx -x NCCL_DEBUG_SUBSYS=NET -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17 -x NCCL_SOCKET_IFNAME=eth5 -x NCCL_COLLNET_ENABLE=1 --host 10.101.42.2:8,10.101.42.3:8,10.101.42.4:8,10.101.42.5:8 ./build/all_reduce_perf -b 4G -e 4G -f 0 -i 0 -g 1
image
image
image

@AddyLaddy
Copy link
Collaborator

Do you need to use -x NCCL_PLUGIN_P2P=ucx I don't believe that is the default for the SHARP plugin.
You don't need UCX in order to use SHARP in the external plugin.

@liuxingbo12138
Copy link
Author

Do you need to use -x NCCL_PLUGIN_P2P=ucx I don't believe that is the default for the SHARP plugin. You don't need UCX in order to use SHARP in the external plugin.

When I run megatron llm, if I don't specify NCCL-PLUGIN-P2P=ucx and NCCL_UCX-TLS=rc_x, cuda_copy parameters, it will report the following error
image
image

@liuxingbo12138
Copy link
Author

I want to know how much difference there is between using sharp and not using sharp in large model training

@AddyLaddy
Copy link
Collaborator

Is SHARP enabled and configured on your system? I think you need to contact the system vendor or an Nvidia support representative/SA in order to be able to diagnose this complex issue with using SHARP and LLM training.

@liuxingbo12138
Copy link
Author

Is SHARP enabled and configured on your system? I think you need to contact the system vendor or an Nvidia support representative/SA in order to be able to diagnose this complex issue with using SHARP and LLM training.

OK,thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants