-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS #222
Comments
Do you need to use |
I want to know how much difference there is between using sharp and not using sharp in large model training |
Is SHARP enabled and configured on your system? I think you need to contact the system vendor or an Nvidia support representative/SA in order to be able to diagnose this complex issue with using SHARP and LLM training. |
OK,thanks |
when i run nccl-test with sharp, i meet the error, what cause this
I tested using the NGC 24.05 version image, so it shouldn't be an environmental issue, if I remove the cuda_copy parameter from NCCL_UCX-TLS=rc_x, it can run normally, and there will also be job creation related to sharp in ufm, but the speed is the same as not driving Sharp
mpirun -mca plm_rsh_args "-p 12138" --allow-run-as-root --bind-to socket -x LD_LIBRARY_PATH -x NCCL_UCX_RNDV_THRESH=0 -x UCX_MEMTYPE_CACHE=n -x NCCL_COLLNET_ENABLE=1 -x NCCL_PLUGIN_P2P=ucx -x NCCL_DEBUG_SUBSYS=NET -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17 -x NCCL_SOCKET_IFNAME=eth5 -x NCCL_COLLNET_ENABLE=1 --host 10.101.42.2:8,10.101.42.3:8,10.101.42.4:8,10.101.42.5:8 ./build/all_reduce_perf -b 4G -e 4G -f 0 -i 0 -g 1
The text was updated successfully, but these errors were encountered: