Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It has RDMA net device in continer,but " Init RDMA failed!Create rdma server failed!" Why? #2004

Open
hsh258 opened this issue Oct 24, 2024 · 54 comments

Comments

@hsh258
Copy link

hsh258 commented Oct 24, 2024

rpc_server.cc:112] Init RDMA failed!Create rdma server failed!

Describe your problem

A clear and concise description of what your problem is. It might be a bug,
a feature request, or just a problem that need support from the vineyard team.


If is is a bug report, to help us reproducing this bug, please provide information below:

  1. Your Operation System version (uname -a):
  2. The version of vineyard you use (vineyard.__version__):
  3. Versions of crucial packages, such as gcc, numpy, pandas, etc.:
  4. Full stack of the error (if there are a crash):
  5. Minimized code to reproduce the error:

If it is a feature request, please provides a clear and concise description of what you want to happen:

What is the problem:

The behaviour that you expect to work:

Additional context

Add any other context about the problem here.

@dashanji
Copy link
Member

Hi @hsh258, could you please use something like ib_write_bw or lib-fabric to check whether the rdma dev can work.

@hsh258
Copy link
Author

hsh258 commented Oct 28, 2024

Hi @hsh258, could you please use something like ib_write_bw or lib-fabric to check whether the rdma dev can work.

Hi,there are some details:
scene:in container
fi_getinfo: return -FI_ENODATA

find / -name 'librdmacm*' 2>/dev/null
/var/lib/dpkg/info/librdmacm1:amd64.shlibs
/var/lib/dpkg/info/librdmacm1:amd64.triggers
/var/lib/dpkg/info/librdmacm1:amd64.symbols
/var/lib/dpkg/info/librdmacm1:amd64.md5sums
/var/lib/dpkg/info/librdmacm1:amd64.list
/var/cache/apt/archives/librdmacm1_28.0-1ubuntu1_amd64.deb
/usr/lib/x86_64-linux-gnu/librdmacm.so
/usr/lib/x86_64-linux-gnu/librdmacm.so.1.2.28.0
/usr/lib/x86_64-linux-gnu/librdmacm.so.1
/usr/share/doc/librdmacm1

find / -name 'libibverbs*' 2>/dev/null
/etc/libibverbs.d
/var/lib/dpkg/info/libibverbs1:amd64.md5sums
/var/lib/dpkg/info/libibverbs1:amd64.shlibs
/var/lib/dpkg/info/libibverbs-dev:amd64.list
/var/lib/dpkg/info/libibverbs1:amd64.list
/var/lib/dpkg/info/libibverbs-dev:amd64.md5sums
/var/lib/dpkg/info/libibverbs1:amd64.postinst
/var/lib/dpkg/info/libibverbs1:amd64.symbols
/var/lib/dpkg/info/libibverbs1:amd64.triggers
/var/cache/apt/archives/libibverbs1_28.0-1ubuntu1_amd64.deb
/var/cache/apt/archives/libibverbs-dev_28.0-1ubuntu1_amd64.deb
/usr/lib/x86_64-linux-gnu/pkgconfig/libibverbs.pc
/usr/lib/x86_64-linux-gnu/libibverbs.so
/usr/lib/x86_64-linux-gnu/libibverbs.a
/usr/lib/x86_64-linux-gnu/libibverbs
/usr/lib/x86_64-linux-gnu/libibverbs.so.1.8.28.0
/usr/lib/x86_64-linux-gnu/libibverbs.so.1
/usr/share/doc/libibverbs1
/usr/share/doc/libibverbs-dev

apt-cache search libfabric
libfabric-bin - Diagnosis programs for the libfabric communication library
libfabric-dev - Development files for libfabric1
libfabric1 - libfabric communication library

dpkg -l | grep libfabric
ii libfabric1 1.6.2-3ubuntu0.1 amd64 libfabric communication library

however, it has no fi_info tool, can't check"fi_info -p verbs"

about "whether the rdma dev can work": the rdma dev can work, surely

fi_info
/usr/local/bin/.libs/fi_info: /lib/x86_64-linux-gnu/libfabric.so.1: version FABRIC_1.4' not found (required by /usr/local/bin/.libs/fi_info) /usr/local/bin/.libs/fi_info: /lib/x86_64-linux-gnu/libfabric.so.1: version FABRIC_1.7' not found (required by /usr/local/bin/.libs/fi_info)
root@d0cf4f0fd8bb:/usr/local/bin# find / -name 'libfabric.so' 2>/dev/null
/usr/lib/x86_64-linux-gnu/libfabric.so.1
/usr/lib/x86_64-linux-gnu/libfabric.so.1.9.15
root@d0cf4f0fd8bb:/usr/local/bin# ls -l /usr/lib/x86_64-linux-gnu/libfabric.so.1
lrwxrwxrwx 1 root root 19 Nov 30 2022 /usr/lib/x86_64-linux-gnu/libfabric.so.1 -> libfabric.so.1.9.15

@vegetableysm
Copy link
Collaborator

vegetableysm commented Oct 28, 2024

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this:
image

And could you give me your command to run vineyardd? Thanks.

By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

@hsh258
Copy link
Author

hsh258 commented Oct 28, 2024

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this: image

And could you give me your command to run vineyardd? Thanks.

By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

Hi,
Here is command
./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02

fi_info -p verbs
fi_getinfo: -61 (No data available)
lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0
lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0
-rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0

@vegetableysm
Copy link
Collaborator

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this: image
And could you give me your command to run vineyardd? Thanks.
By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02

fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0

Does "fd00:80:2200:3205::1207:b02" is an ipv6 address? Currently vineyard does not support ipv6 address resolution, please try it again with ipv4 address. Additionally, rdma devices requires root privileges. Are you doing this as root?

By the way, RDMA module of vineyard is based on libfabric, so if the fabric component "fi_info" can't see the information of RDMA device, vineyard can't get it either.

@vegetableysm
Copy link
Collaborator

In addition, the param of "--rdma_endpoint" needs to specify port information for the address. Such as:
./vineyardd --rdma_endpoint=ipv4_addr:port

@hsh258
Copy link
Author

hsh258 commented Oct 28, 2024

apt-cache search libfabric

Hi!Could you give me more details? For example, specific error messages like this: image
And could you give me your command to run vineyardd? Thanks.
By the way, you can install the fabtest to run fi_info. Please make sure the fabtest version is compatible with libfabric.

Hi, Here is command ./vineyardd --rdma_endpoint fd00:80:2200:3205::1207:b02
fi_info -p verbs fi_getinfo: -61 (No data available) lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so -> libfabric.so.1.24.0 lrwxrwxrwx. 1 root root 19 Oct 26 09:17 libfabric.so.1 -> libfabric.so.1.24.0 -rwxr-xr-x. 1 root root 1187520 Oct 26 09:17 libfabric.so.1.24.0

Does "fd00:80:2200:3205::1207:b02" is an ipv6 address? Currently vineyard does not support ipv6 address resolution, please try it again with ipv4 address. Additionally, rdma devices requires root privileges. Are you doing this as root?

By the way, RDMA module of vineyard is based on libfabric, so if the fabric component "fi_info" can't see the information of RDMA device, vineyard can't get it either.

Hi, it is ip6 address。as root login
detail error info:
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable perf_cntr=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable hook=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable hmem=
libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_CUDA not supported
libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_ROCR not supported
libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_ZE not supported
libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_NEURON not supported
libfabric:2795:1730110628::core:core:ofi_hmem_init():658 Hmem iface FI_HMEM_SYNAPSEAI not supported
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable hmem_disable_p2p=
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor uffd
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor memhooks
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor cuda
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor cuda_ipc
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor rocr
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor rocr_ipc
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor xpmem
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor ze
libfabric:2795:1730110628::core:mr:ofi_monitors_init():222 Initializing memory monitor import
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable mr_cache_max_size=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable mr_cache_max_count=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable mr_cache_monitor=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable mr_cuda_cache_monitor_enabled=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable mr_rocr_cache_monitor_enabled=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable mr_ze_cache_monitor_enabled=
libfabric:2795:1730110628::core:mr:ofi_default_cache_size():83 default cache size=5633248768
libfabric:2795:1730110628::core:mr:ofi_monitors_init():306 Default memory monitor is: memhooks
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable provider=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable universe_size=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable av_remove_cleanup=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable offload_coll_provider=
libfabric:2795:1730110628::core:core:fi_param_get_():372 variable provider_path=
libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: udp (121.0)
libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: sockets (121.0)
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable prov_name=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable port_high_range=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable port_low_range=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable tx_size=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable rx_size=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable max_inject=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable max_saved=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable max_saved_size=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable max_rx_size=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable nodelay=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable staging_sbuf_size=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable prefetch_rbuf_size=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable zerocopy_size=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable trace_msg=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable disable_auto_progress=
libfabric:2795:1730110628::tcp:core:fi_param_get_():372 variable io_uring=
libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: tcp (121.0)
libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: ofi_hook_noop (121.0)
libfabric:2795:1730110628::core:core:ofi_register_provider():518 registering provider: off_coll (121.0)

By the way,what time to support ip6?

@vegetableysm
Copy link
Collaborator

vegetableysm commented Oct 28, 2024

I think one of the reasons vineyard was unable to create an RDMA server was because of the ipv6 address. But I don't know why fi_info can't get device information. If fi_info does not get device information, vineyard theoretically cannot get device information even if it is using ipv4.

And ipv6 support is not in our short-term plans at the moment. You can open a new issue about the ipv6 support and we may support ipv6 in the future. Thanks.

@hsh258
Copy link
Author

hsh258 commented Oct 28, 2024

I think one of the reasons vineyard was unable to create an RDMA server was because of the ipv6 address. But I don't know why fi_info can't get device information. If fi_info does not get device information, vineyard theoretically cannot get device information even if it is using ipv4.

And ipv6 support is not in our short-term plans at the moment. You can open a new issue about the ipv6 support and we may support ipv6 in the future. Thanks.

Hi,
Whether or not to install other package besides librdmacm.so and libibverbs.so in container scene ?For example ofed,and so on。
About fi_info,I use it by copy libfabric/util/fi_info and libfabric/util/.libs/ to container. Is this method okay?
Use ip4,the appearance is same to ip6 .fi_getinfo return -FI_ENODATA too.
In container(ip4 or ip6),use rdma link command,can look up rdma dev,but fi_info -p verebs ,has nothing.
rdma link
link mlx5_2/1 state DOWN physical_state DISABLED
link mlx5_3/1 state DOWN physical_state DISABLED
link mlx5_4/1 state ACTIVE physical_state LINK_UP
link mlx5_5/1 state ACTIVE physical_state LINK_UP
link mlx5_6/1 state DOWN physical_state DISABLED
link mlx5_7/1 state DOWN physical_state DISABLED
link mlx5_8/1 state ACTIVE physical_state LINK_UP
link mlx5_9/1 state ACTIVE physical_state LINK_UP
link mlx5_10/1 state DOWN physical_state DISABLED
link mlx5_11/1 state DOWN physical_state DISABLED
link mlx5_12/1 state ACTIVE physical_state LINK_UP
link mlx5_13/1 state ACTIVE physical_state LINK_UP
link mlx5_14/1 state DOWN physical_state DISABLED
link mlx5_15/1 state DOWN physical_state DISABLED
link mlx5_16/1 state ACTIVE physical_state LINK_UP
link mlx5_17/1 state ACTIVE physical_state LINK_UP
link mlx5_bond_1/1 state ACTIVE physical_state LINK_UP
link mlx5_0/1 state ACTIVE physical_state LINK_UP
link mlx5_1/1 state ACTIVE physical_state LINK_UP
link mlx5_18/1 state DOWN physical_state DISABLED
link mlx5_19/1 state DOWN physical_state DISABLED
link mlx5_20/1 state DOWN physical_state DISABLED
link mlx5_21/1 state DOWN physical_state DISABLED
link mlx5_22/1 state DOWN physical_state DISABLED
link mlx5_23/1 state DOWN physical_state DISABLED
link mlx5_24/1 state DOWN physical_state DISABLED
link mlx5_25/1 state DOWN physical_state DISABLED
link mlx5_26/1 state DOWN physical_state DISABLED
link mlx5_27/1 state DOWN physical_state DISABLED
link mlx5_28/1 state DOWN physical_state DISABLED
link mlx5_29/1 state DOWN physical_state DISABLED
link mlx5_30/1 state DOWN physical_state DISABLED
link mlx5_31/1 state DOWN physical_state DISABLED
link mlx5_32/1 state DOWN physical_state DISABLED
link mlx5_33/1 state DOWN physical_state DISABLED
link mlx5_34/1 state DOWN physical_state DISABLED
link mlx5_35/1 state DOWN physical_state DISABLED
link mlx5_36/1 state DOWN physical_state DISABLED
link mlx5_37/1 state DOWN physical_state DISABLED
link mlx5_38/1 state DOWN physical_state DISABLED
link mlx5_39/1 state DOWN physical_state DISABLED
link mlx5_40/1 state DOWN physical_state DISABLED
link mlx5_41/1 state DOWN physical_state DISABLED
link mlx5_42/1 state DOWN physical_state DISABLED
link mlx5_43/1 state DOWN physical_state DISABLED
link mlx5_44/1 state DOWN physical_state DISABLED
link mlx5_45/1 state DOWN physical_state DISABLED
link mlx5_46/1 state DOWN physical_state DISABLED
link mlx5_47/1 state DOWN physical_state DISABLED
link mlx5_48/1 state ACTIVE physical_state LINK_UP netdev net_101
link mlx5_49/1 state DOWN physical_state DISABLED

@vegetableysm
Copy link
Collaborator

Fabric depends on libibverbs. So libibverbs is necessery.

I suggests that you should install the libfabric and fabtest to use fi_info. Refer to the script below:

For fabric dependencies(CentOS):

yum -y install rdma-core libibverbs libibverbs-devel

Install fabric and fabtests

cd /tmp
wget https://github.com/ofiwg/libfabric/releases/download/v1.22.0/libfabric-1.22.0.tar.bz2
tar xf ./libfabric-1.22.0.tar.bz2
cd libfabric-1.22.0/
./configure --disable-usnic \
            --disable-psm3 \
            --disable-opx \
            --disable-dmabuf_peer_mem \
            --disable-hook_hmem \
            --disable-hook_debug \
            --disable-trace \
            --disable-rxm \
            --disable-psm2 \
            --disable-xpmem \
            --disable-shm \
            --disable-rxd \
            --disable-perf \
            --disable-efa \
            --disable-mrail \
            --enable-verbs \
            --with-cuda=no
make -j
make install

cd /tmp
wget https://github.com/ofiwg/libfabric/releases/download/v1.22.0/fabtests-1.22.0.tar.bz2
tar xf ./fabtests-1.22.0.tar.bz2
cd fabtests-1.22.0
./configure
make -j
make install

Again, vineyard compiles the fabric in the submodule itself, so in theory you only need the ibverbs library to use vineyard RDMA support (The premise is that fabric can also work alone). You can install the fabtests according to the script above and see if the fabtests works.(Such as fi_rma_bw / fi_info)

@hsh258
Copy link
Author

hsh258 commented Oct 29, 2024

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port
The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

@vegetableysm
Copy link
Collaborator

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

Without mask. It should be the ipv4 address of RDMA device.

@vegetableysm
Copy link
Collaborator

vegetableysm commented Oct 30, 2024

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

The format of "ipv4:port" only affects the parsing of the port. The reason that it cannot use IPv6 is the same, as it parses the content after the first ":" as the port. The currently specified RDMA IPv4 address will not take effect; instead, it will automatically look for the first suitable RDMA device.

Refer to
#2005
#2006

This WIP PR supports specifying a particular RDMA device by indicating its IPv4 address. However, it cannot be merged into the main branch for now because the CI failed. Refer to:
#2008

@vegetableysm
Copy link
Collaborator

vegetableysm commented Oct 30, 2024

Therefore, I suggest that it is a priority to ensure that fi_info can get the RDMA device information. If fi_info can retrieve the RDMA device information, then vineyard should initialize successfully. If fi_info cannot get the RDMA device information, then vineyard will not be able to initialize either.

@hsh258
Copy link
Author

hsh258 commented Oct 30, 2024

./vineyardd --rdma_endpoint=ipv4_addr:port

Hi ./vineyardd --rdma_endpoint=ipv4_addr:port The command ipv4_addr has or not mask? format is 1.2.3.4/24:1234 or 1.2.3.4:1234 ?

The format of "ipv4:port" only affects the parsing of the port. The reason that it cannot use IPv6 is the same, as it parses the content after the first ":" as the port. The currently specified RDMA IPv4 address will not take effect; instead, it will automatically look for the first suitable RDMA device.

Refer to #2005 #2006

This WIP PR supports specifying a particular RDMA device by indicating its IPv4 address. However, it cannot be merged into the main branch for now because the CI failed. Refer to: #2008

Hi
As I has Ip6 environment only,I will try Ip6.Whether or not is ok when i parses ipv6:port to get port? tks.
By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT?
If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?tks

@vegetableysm
Copy link
Collaborator

vegetableysm commented Oct 31, 2024

Whether or not is ok when i parses ipv6:port to get port?

No. But you can give a fake ipv4 address. Because vineyard will automatically look for the first suitable RDMA device. As I said above, specifying NIC by address will be supported in the next pr.

By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT?

If you provide rdma_endpoint when trying to connect to vineyardd, this environment variable will not be read. If you don't give it, it will try to read it. Environment variables are also set by the user.

If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?

They won't conflict.

@hsh258
Copy link
Author

hsh258 commented Oct 31, 2024

Whether or not is ok when i parses ipv6:port to get port?

No. But you can give a fake ipv4 address. Because vineyard will automatically look for the first suitable RDMA device. As I said above, specifying NIC by address will be supported in the next pr.

By the way,as rdma client,read_env("VINEYARD_RDMA_ENDPOINT"), Who write VINEYARD_RDMA_ENDPOINT?

If you provide rdma_endpoint when trying to connect to vineyardd, this environment variable will not be read. If you don't give it, it will try to read it. Environment variables are also set by the user.

If RPCServer has tcp server and RDMAServer with different port at the same time, Whether or not to conflict?

They won't conflict.

Hi
as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks

@vegetableysm
Copy link
Collaborator

vegetableysm commented Oct 31, 2024

as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks

The client should use the exact ipv4 address of vineyard server. Suppose the NIC of the server is at the address 1.2.3.4, and vineyard RDMA server use port of 1234. You should use 1.2.3.4:1234 as the VINEYARD_RDMA_ENDPOINT of client. It is also currently not possible for client to specify the NIC used to send the data, so this field means the address of the server.

As I said above, specifying NIC by address will be supported in the next pr.
These include the NIC used by the server to receive data and the NIC used by the client to send data.

To summarize, there is currently no way for the server to specify the NIC, and the server will automatically select the appropriate NIC to listen to RDMA messages. The rdma endpoint on the client side is the address of the server. It is also currently not possible for client to specify the NIC used for sending data. The feature to specify NIC will be supported in the pr mentioned above, but cannot currently be merged into the main branch.

@hsh258
Copy link
Author

hsh258 commented Oct 31, 2024

as client,usr set VINEYARD_RDMA_ENDPOINT,for example 1.2.3.4:1234,is it client self address or server address ? tks

The client should use the exact ipv4 address of vineyard server. Suppose the NIC of the server is at the address 1.2.3.4, and vineyard RDMA server use port of 1234. You should use 1.2.3.4:1234 as the VINEYARD_RDMA_ENDPOINT of client. It is also currently not possible for client to specify the NIC used to send the data, so this field means the address of the server.

As I said above, specifying NIC by address will be supported in the next pr.
These include the NIC used by the server to receive data and the NIC used by the client to send data.

To summarize, there is currently no way for the server to specify the NIC, and the server will automatically select the appropriate NIC to listen to RDMA messages. The rdma endpoint on the client side is the address of the server. It is also currently not possible for client to specify the NIC used for sending data. The feature to specify NIC will be supported in the pr mentioned above, but cannot currently be merged into the main branch.

Hi
Server has start,as server config itself container address,--rdma_endpoint 60.30.10.50:9600
I20241031 15:04:05.911379 7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC
I20241031 15:04:08.535791 7 rpc_server.cc:109] Vineyard will listen on 60.30.10.2:9600 for RDMA
I20241031 15:04:08.537207 7 meta_service.cc:1195] Instance join: 0
Above,60.30.10.50 is rdma net dev address.

However,detail show rpc service is TCP, no RDMA.
kubectl get services -n admin |grep vineyard
vineyard-controller-manager-metrics-service ClusterIP 8443/TCP 21m
vineyard-webhook-service ClusterIP
443/TCP 21m
vineyardd-sample-etcd-service ClusterIP 2379/TCP 8m18s
vineyardd-sample-redis-0 ClusterIP 6379/TCP 8m18s
vineyardd-sample-redis-service ClusterIP 6379/TCP 8m18s
vineyardd-sample-rpc ClusterIP 102.11.82.204
9600/TCP 8m17s

At the same time ,
As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show
Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed.
Why?client can't find server?
As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Connect rdma server failed! retry: 1 times.
it show cilent can find driver.
Above server and client are in the same node.

@vegetableysm
Copy link
Collaborator

vegetableysm commented Nov 1, 2024

I20241031 15:04:05.911379 7 rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC
I20241031 15:04:08.535791 7 rpc_server.cc:109] Vineyard will listen on 60.30.10.2:9600 for RDMA

Do not make RDMA and RPC work on the same port if they use the same NIC.

@vegetableysm
Copy link
Collaborator

As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show
Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed.
Why?client can't find server?
As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Are the client and server in the same container? If not, can the client's container connect to the server?

You can start the server and client in the same container to test if the vineyard RDMA works.

@hsh258
Copy link
Author

hsh258 commented Nov 2, 2024

As client, if VINEYARD_RDMA_ENDPOINT set to server address 60.30.10.50:9600,show
Connect RDMA server failed! Fall back to RPC mode. Error:fi_getinfo failed.
Why?client can't find server?
As client, if VINEYARD_RDMA_ENDPOINT set to itself address 60.30.10.51:9600,show
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Are the client and server in the same container? If not, can the client's container connect to the server?

You can start the server and client in the same container to test if the vineyard RDMA works.

Hi
Now ,when client put data,then it has issue,Why?
As client:

import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)
Connected to RPC server: vineyardd-sample-rpc.admin:9600, RDMA server: 10.11.228.2:9600
objid = rpc_client.put(np.zeros(8))
mlx5: vineyard-python-client-847f8b8b86-s9t55: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 08000241 0001fdd2
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 823, in put
return put(self, value, builder, persist, name, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 197, in put
meta = get_current_builders().run(client, value, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 100, in run
return self.factory[ty](client, value, **kw)
File "/usr/local/lib/python3.8/dist-packages/vineyard/data/tensor.py", line 90, in numpy_ndarray_builder
meta.add_member('buffer
', build_numpy_buffer(client, value))
File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 178, in build_numpy_buffer
return build_buffer(client, address, array.nbytes)
File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 157, in build_buffer
return client.create_remote_blob(buffer)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 575, in create_remote_blob
return self.rpc_client.create_remote_blob(blob_builder)
vineyard._C.InvalidException: Invalid: GetTXCompletion failed:-5

As server:
E20241103 02:35:23.640290 175 rpc_server.cc:203] Receive vineyard request mem!
E20241103 02:35:23.640353 175 rpc_server.cc:208] Receive remote request address: 0x7f04ebbfe040 size: 64
E20241103 02:35:23.640825 175 rpc_server.cc:238] Failed to register mem.
E20241103 02:35:23.645750 273 rpc_server.cc:389] Connection error!Client crashed.

E20241103 02:53:37.765411 178 rpc_server.cc:203] Receive vineyard request mem!
E20241103 02:53:37.765699 178 rpc_server.cc:208] Receive remote request address: 0x7f04ebffe100 size: 4194304
E20241103 02:53:37.765759 178 rpc_server.cc:238] Failed to register mem.
E20241103 02:53:37.771457 273 rpc_server.cc:389] Connection error!Client crashed.

E20241103 03:03:53.214725 181 rpc_server.cc:203] Receive vineyard request mem!
E20241103 03:03:53.214792 181 rpc_server.cc:208] Receive remote request address: 0x7f04ec3fe140 size: 8192
E20241103 03:03:53.214845 181 rpc_server.cc:238] Failed to register mem.
E20241103 02:53:37.771457 273 rpc_server.cc:389] Connection error!Client crashed.

As the same client,don't change anything, it is sometimes possible to put success, but sometimes fail.
I currently can only find the focus is func fi_mr_regattr.But what time success,what time fail as the same envirment?

@vegetableysm
Copy link
Collaborator

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

@hsh258
Copy link
Author

hsh258 commented Nov 4, 2024

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

Hi, instructions are there:
use python client, login
export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address
python3

import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success
objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.

@vegetableysm
Copy link
Collaborator

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3

import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success
objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.

And the command of starting a vineyardd?

@hsh258
Copy link
Author

hsh258 commented Nov 4, 2024

Hi. Could you please show me the complete instructions to start vineyardd and the code of putting object on the client side? Let me test it locally.

Hi, instructions are there: use python client, login export VINEYARD_RDMA_ENDPOINT=10.13.228.2:9600 // 10.13.228.2 is the rdma server address python3

import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600) //it needs adout 20s success
objid = rpc_client.put(np.zeros(8)) // it is sometimes possible to put success, but sometimes fail.

And the command of starting a vineyardd?
as server, set:
json RpcSpecResolver::resolve() const {
json spec;
spec["rpc"] = FLAGS_rpc;
spec["port"] = FLAGS_rpc_socket_port;
spec["rdma_endpoint"] = "10.13.228.2:9600";
return spec;
}
then deployment vineyard use helm chart

@vegetableysm
Copy link
Collaborator

And If registering memory fails, try increasing the vineyard's available memory.

@hsh258
Copy link
Author

hsh258 commented Nov 4, 2024

And If registering memory fails, try increasing the vineyard's available memory.
Hi
as 2Gi memory,it is same, sometimes possible to put success, but sometimes fail.
cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
name: vineyardd-sample
namespace: admin
spec:
replicas: 1
metaServiceReplicas: 1
service:
type: ClusterIP
port: 9600
vineyard:
image: test:test/admin/vineyardd:v1
imagePullPolicy: IfNotPresent
cpu: "2"
memory: "2Gi"
securityContext:
privileged: true
EOF

@dashanji
Copy link
Member

dashanji commented Nov 4, 2024

Hi @hsh258. The 9600 is the default port for RPC, you should define another unique port for RDMA endpoint such as "10.13.228.2:9601"

@hsh258
Copy link
Author

hsh258 commented Nov 4, 2024

Hi @hsh258. The 9600 is the default port for RPC, you should define another unique port for RDMA endpoint such as "10.13.228.2:9601"

Hi ,
If set other port, connect fail,show:

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601

python3

Python 3.8.10 (default, Sep 11 2024, 16:02:53)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',19601)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.8/dist-packages/vineyard/init.py", line 418, in connect
return Client(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 296, in init
raise ConnectionError(
ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables VINEYARD_IPC_SOCKET and VINEYARD_RPC_ENDPOINT, as well as the configuration file, are all unavailable.
rpc_client = vineyard.connect('10.13.228.2',19601)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.8/dist-packages/vineyard/init.py", line 418, in connect
return Client(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 296, in init
raise ConnectionError(
ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables VINEYARD_IPC_SOCKET and VINEYARD_RPC_ENDPOINT, as well as the configuration file, are all unavailable.

@dashanji
Copy link
Member

dashanji commented Nov 4, 2024

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi, accord to above order,
The issue is same,especially when server and client are in the different node.
In server:
rpc_server.cc:105] Vineyard will listen on 0.0.0.0:9600 for RPC
rpc_server.cc:109] Vineyard will listen on 13.13.229.3:19800 for RDMA
rpc_server.cc:203] Receive vineyard request mem!
rpc_server.cc:208] Receive remote request address: 0x7fe39bfff040 size: 31457280
rpc_server.cc:241] Failed to register mem. size 31457280
rpc_server.cc:392] Connection error!Client crashed.
rpc_server.cc:334] Receive close msg!
rpc_server.cc:400] Get RX completion failed! Error:Client crashed.

In client:
mlx5: vineyard-python-client-65dfffb656-ghc72: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 080004bc 000106d2
Traceback (most recent call last):
File "", line 1, in
File "", line 8, in process_blocks
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 823, in put
return put(self, value, builder, persist, name, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 197, in put
meta = get_current_builders().run(client, value, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/builder.py", line 100, in run
return self.factory[ty](client, value, **kw)
File "/usr/local/lib/python3.8/dist-packages/vineyard/data/tensor.py", line 90, in numpy_ndarray_builder
meta.add_member('buffer
', build_numpy_buffer(client, value))
File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 178, in build_numpy_buffer
return build_buffer(client, address, array.nbytes)
File "/usr/local/lib/python3.8/dist-packages/vineyard/data/utils.py", line 157, in build_buffer
return client.create_remote_blob(buffer)
File "/usr/local/lib/python3.8/dist-packages/vineyard/core/client.py", line 575, in create_remote_blob
return self.rpc_client.create_remote_blob(blob_builder)
vineyard._C.InvalidException: Invalid: GetTXCompletion failed:-5

@dashanji
Copy link
Member

dashanji commented Nov 5, 2024

Could you please add an option (size:1024Mi) to the vineyard yaml as follows and try again?

Hi

as 2Gi memory,it is same, sometimes possible to put success, but sometimes fail.

cat <<EOF | kubectl apply -f -

apiVersion: k8s.v6d.io/v1alpha1

kind: Vineyardd

metadata:

name: vineyardd-sample

namespace: admin

spec:

replicas: 1

metaServiceReplicas: 1

service:

type: ClusterIP

port: 9600

vineyard:
size:1024Mi
image: test:test/admin/vineyardd:v1

imagePullPolicy: IfNotPresent

cpu: "2"

memory: "2Gi"

securityContext:

privileged: true

EOF

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

size:1024Mi

Hi
deploy has issue in "size" ,
size:1024Mi
error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context
size:"1024Mi"
error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context

@dashanji
Copy link
Member

dashanji commented Nov 5, 2024

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi, above issue,
error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context
As deply:
cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
name: vineyardd-sample
namespace: admin
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
vineyard:
size: 1024Mi
image: test:test/admin/vineyardd:v1
imagePullPolicy: IfNotPresent
cpu: "2"
memory: "2Gi"
securityContext:
privileged: true
EOF

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi,
As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds.
add debug info,it focus the func
CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(),
std::to_string(port).c_str(), 0, hints,
reinterpret_cast<fi_info**>(&(info.fi)))
Could you tell me how to shorten the time? tks

@dashanji
Copy link
Member

dashanji commented Nov 5, 2024

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

Does it work now? I think It should be caused by the indentation problem.

@dashanji
Copy link
Member

dashanji commented Nov 5, 2024

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks

It shouldn't be very slow, what's your k8s environment (ack/aws/...) and machine environment ?

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

Sorry for the misleading indentation. You could try the following command.

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi, above issue, error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context As deply: cat <<EOF | kubectl apply -f - apiVersion: k8s.v6d.io/v1alpha1 kind: Vineyardd metadata: name: vineyardd-sample namespace: admin spec: replicas: 1 service: type: ClusterIP port: 9600 vineyard: size: 1024Mi image: test:test/admin/vineyardd:v1 imagePullPolicy: IfNotPresent cpu: "2" memory: "2Gi" securityContext: privileged: true EOF

Does it work now? I think It should be caused by the indentation problem.

Hi,
It always has the issue,so current delete "size: 1024Mi" when deploy it.

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

The rpc must be connected at first while using the rdma, you can try the following code.

export VINEYARD_RDMA_ENDPOINT=10.13.228.2:19601
import numpy as np
import vineyard
rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)

Hi, As exec "rpc_client = vineyard.connect('vineyardd-sample-rpc.admin',9600)",it cost long time,from a few seconds to over 20 seconds. add debug info,it focus the func CHECK_ERROR(!fi_getinfo(VINEYARD_FIVERSION, server_address.c_str(), std::to_string(port).c_str(), 0, hints, reinterpret_cast<fi_info**>(&(info.fi))) Could you tell me how to shorten the time? tks

It shouldn't be very slow, what's your k8s environment (ack/aws/...) and machine environment ?

Hi,
kubectl Version: v1.28.3

@dashanji
Copy link
Member

dashanji commented Nov 5, 2024

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
name: vineyardd-sample
namespace: admin
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
vineyard:
size: 1024Mi
image: test:test/admin/vineyardd:v1
imagePullPolicy: IfNotPresent
cpu: "2"
memory: "2Gi"
securityContext:
privileged: true
EOF

How do you install the vineyard operator? Besides, can you copy the code to the shell and try again, it's better to show the failed screenshot so that we can check where is wrong.

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
name: vineyardd-sample
namespace: admin
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
vineyard:
size: 1024Mi
image: test:test/admin/vineyardd:v1
imagePullPolicy: IfNotPresent
cpu: "2"
memory: "2Gi"
securityContext:
privileged: true
EOF

How do you install the vineyard operator? Besides, can you copy the code to the shell and try again, it's better to show the failed screenshot so that we can check where is wrong.

Hi,
Use helm chart to install the vineyard operator

@dashanji
Copy link
Member

dashanji commented Nov 5, 2024

error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context
This error shouldn't happen. In my test environment, it can work fine as follows.

$ helm repo update
$ kubectl create namespace vineyard-system
$ helm install vineyard-operator vineyard/vineyard-operator -n vineyard-system
$ cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

image

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

error: error parsing STDIN: error converting YAML to JSON: yaml: line 16: mapping values are not allowed in this context
This error shouldn't happen. In my test environment, it can work fine as follows.

$ helm repo update
$ kubectl create namespace vineyard-system
$ helm install vineyard-operator vineyard/vineyard-operator -n vineyard-system
$ cat <<EOF | kubectl apply -f -
apiVersion: k8s.v6d.io/v1alpha1
kind: Vineyardd
metadata:
  name: vineyardd-sample
  namespace: admin
spec:
  replicas: 1
  service:
    type: ClusterIP
    port: 9600
  vineyard:
    size: 1024Mi
    image: test:test/admin/vineyardd:v1
    imagePullPolicy: IfNotPresent
    cpu: "2"
    memory: "2Gi"
  securityContext:
    privileged: true
EOF

Hi,
Try it, and can deploy,tks.
By the way, connect cost long time still.

@dashanji
Copy link
Member

dashanji commented Nov 5, 2024

I think it is likely to be caused by your environmental factors.
Can RDMA work now?

@hsh258
Copy link
Author

hsh258 commented Nov 5, 2024

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi,
Now RDMA doesn't work normal.
It connect very slowly,and put and get data very slow,too.

@hsh258
Copy link
Author

hsh258 commented Nov 6, 2024

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi,
I want to try to set server IP by fi_getinfo.Is the method feasible?If feasible,how to set? tks.

@vegetableysm
Copy link
Collaborator

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi, I want to try to set server IP by fi_getinfo.Is the method feasible?If feasible,how to set? tks.

Refer to src/common/rdma/rdma_client.cc, src/common/rdma/rdma_server.cc and https://ofiwg.github.io/libfabric/

Vineyard client get the server RDMA device info by calling fi_getinfo with param of server ip address.

@hsh258
Copy link
Author

hsh258 commented Nov 7, 2024

I think it is likely to be caused by your environmental factors. Can RDMA work now?

Hi, Now RDMA doesn't work normal. It connect very slowly,and put and get data very slow,too.

Hi,
About speed of rdma write and read is slow,I find that the transmition is stucked for over 100 milliseconds every time hundreds or 1000 packages。
Two "RC send only qp=0x000274", are separated by over 100 milliseconds
I want to try to change tx_ctx_cnt and rx_ctx_cntof verbs_domain_attr. Is the method feasible?Or other way? tks.

@vegetableysm
Copy link
Collaborator

vegetableysm commented Nov 7, 2024

Since there is no way to replicate your current environment, and no ipv6 support for vineyard, it may be difficult to give appropriate advice.

I want to try to change tx_ctx_cnt and rx_ctx_cntof verbs_domain_attr. Is the method feasible?Or other way? tks.

We haven't tried to change this field, so I suggest you ask in the fabric community.

@hsh258
Copy link
Author

hsh258 commented Nov 8, 2024

Since there is no way to replicate your current environment, and no ipv6 support for vineyard, it may be difficult to give appropriate advice.

I want to try to change tx_ctx_cnt and rx_ctx_cntof verbs_domain_attr. Is the method feasible?Or other way? tks.

We haven't tried to change this field, so I suggest you ask in the fabric community.

Hi,
the environment is ipv4,the speed of rdma write and read is slow。

@vegetableysm
Copy link
Collaborator

vegetableysm commented Nov 11, 2024

Hi, the environment is ipv4,the speed of rdma write and read is slow。

Hi, what is the size of the blob you are testing now? At present, Vineyard RDMA module needs to adapt to some working conditions, so blob has advantages over TCP only when it is larger than 4M.

@hsh258
Copy link
Author

hsh258 commented Nov 12, 2024

Hi, the environment is ipv4,the speed of rdma write and read is slow。

Hi, what is the size of the blob you are testing now? At present, Vineyard RDMA module needs to adapt to some working conditions, so blob has advantages over TCP only when it is larger than 4M.

Hi,
Blob is 30M,and I find the mainly reason is that recv_message of io.cc cost about 60ms between 30M request.
That is to say,it cost about 60ms to client send request mem after server deregister memory.
How to short the time ? tks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants