Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rxe failed connectivity test #51

Open
mcfatealan opened this issue Sep 7, 2016 · 30 comments
Open

rxe failed connectivity test #51

mcfatealan opened this issue Sep 7, 2016 · 30 comments

Comments

@mcfatealan
Copy link

hi @monis410 , I'm a RDMA beginner. I met a problem very similar to the previous issue (#49).

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_devices
libibverbs: Warning: couldn't load driver 'rxe': librxe-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
    device                 node GUID
    ------              ----------------

I walk around it by moving /usr/lib64/* to /usr/lib/. But after that I have problems on connectivity tests.

My OS is Ubuntu 16.04 LTS (4.7.0-rc3+).

Some of my test result:

mcfatealan@mcfatealan-desktop:~/librxe-dev$ sudo rxe_cfg start 
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
  Name    Link  Driver  Speed  NMTU  IPv4_addr      RDEV  RMTU          
  enp5s0  yes   r8169          1500  192.168.10.19  rxe0  1024  (3)  

mcfatealan@mcfatealan-desktop:~/librxe-dev$ lsmod | grep rxe
rdma_rxe              102400  0
ip6_udp_tunnel         16384  1 rdma_rxe
udp_tunnel             16384  1 rdma_rxe
ib_core               208896  6 rdma_cm,ib_cm,iw_cm,ib_uverbs,rdma_rxe,rdma_ucm

mcfatealan@mcfatealan-desktop:~/librxe-dev$ lsmod | grep ib_uverbs
ib_uverbs              61440  1 rdma_ucm
ib_core               208896  6 rdma_cm,ib_cm,iw_cm,ib_uverbs,rdma_rxe,rdma_ucm

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_devices
    device                 node GUID
    ------              ----------------
    rxe0                be5ff4fffe3acd36

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_devinfo -d rxe0
hca_id: rxe0
    transport:          InfiniBand (0)
    fw_ver:             0.0.0
    node_guid:          be5f:f4ff:fe3a:cd36
    sys_image_guid:         0000:0000:0000:0000
    vendor_id:          0x0000
    vendor_part_id:         0
    hw_ver:             0x0
    phys_port_cnt:          1
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     1024 (3)
            sm_lid:         0
            port_lid:       0
            port_lmc:       0x00
            link_layer:     Ethernet

Then I tested connectivity both on one machines(self-to-self), and on one physical machine and a virtual machine. I can ping each other, so the connectivity of these machines is fine. The test result is exactly the same.

server:
mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0x2b7bf6, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000012, PSN 0x4255a9, GID fe80::be5f:f4ff:fe3a:cd36
//hanging...

client:
mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 0 -d rxe0 -i 1 192.168.10.19
  local address:  LID 0x0000, QPN 0x000012, PSN 0x4255a9, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000011, PSN 0x2b7bf6, GID fe80::be5f:f4ff:fe3a:cd36
//hanging...




server:
mcfatealan@mcfatealan-desktop:~/librxe-dev$ rping -s -a 192.168.10.19 -v -C 10
//hanging...

client:
mcfatealan@mcfatealan-desktop:~/librxe-dev$ rping -c -a 192.168.10.19 -v -C 10
//hanging...

Could you help me have a look on it? Thank you so much!

BTW, could Soft RoCE work well with python-rdma (https://github.com/jgunthorpe/python-rdma)? I tested that too and failed, not sure whether both two problems share the same root.

@yonatanco
Copy link

Hello.
i know this issue. its a race condition that we recently fixed.
I sent a fix for 4.8-rc5.
you can work with upstream instead of github to be up to date.

BTW : are you working with a Mellanox's HCA ? or some ethernet NIC like Intel or Broadcom ?

@mcfatealan
Copy link
Author

Hi @yonatanco , thanks so much for your responding! I'm trying 4.8-rc5 now, later I'll send you my feedbacks.

BTW, here's my hardware info:

mcfatealan@mcfatealan-desktop:~$ lspci | grep 'Ethernet'
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

@mcfatealan
Copy link
Author

Oops. still got the same problem:

mcfatealan@mcfatealan-desktop:~$ uname -a
Linux mcfatealan-desktop 4.8.0-rc5 #1 SMP Mon Sep 12 14:12:15 CST 2016 x86_64 x86_64 x86_64 GNU/Linux

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..

mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 0 -d rxe0 -i 1 192.168.10.19
  local address:  LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..


@yonatanco
Copy link

you are using gid 0. try with gid 1.

ibv_rc_pingpong -g 1 -d rxe0 -i 1
ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19

@mcfatealan
Copy link
Author

mcfatealan commented Sep 12, 2016

Thanks for reminding, @yonatanco . The result stays same..

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 1 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19
  remote address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19

mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19 
  local address:  LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19
  remote address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19

``

@yonatanco
Copy link

On 9/12/2016 2:12 PM, Chang Lou wrote:

Thanks for reminding. The result stays same..

|mcfatealan@mcfatealan-desktop:/librxe-dev$ ibv_rc_pingpong -g 1 -d rxe0
-i 1 local address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID
::ffff:192.168.10.19 remote address: LID 0x0000, QPN 0x000013, PSN
0x4c0dd8, GID ::ffff:192.168.10.19 mcfatealan@mcfatealan-desktop:
$
ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19 local address: LID
0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19 remote
address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID
::ffff:192.168.10.19 `` |


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#51 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AS6tfYiAn42AMO1T0tbi32-kgLZSi7Jzks5qpTOogaJpZM4J2kVU.

are trying to ping using the same host ? loopback ?

@mcfatealan
Copy link
Author

@yonatanco , sorry to reply late.. I didn't receive notification on the main page.
I'm wondering if there's any issue if I rping myself? I could do that on some of my other machines enabled with RNIC. I'm new to RDMA, maybe I'm being silly here...

@anthonyliubin
Copy link

Hello ,mcfatealan
@yonatanco @mcfatealan
I am testing the SoftRoce right now. And meet the same issue as you decribed.
Firstly, i used the rxe-dev-master, the kernel is 4.0.0. But if i run rping , there is RDMA_CM_EVENT_ADDR_ERROR .
Then i changed to use rxe-dev-rxe_submission_v18, the kernel is 4.7.0-rc3. The rping could run, but it is hang up. The server could not receive the RDMA_CM_EVENT_CONNECT_REQUEST event, so rping server side blocked in the function sem_wait(). I did it on the same VM, Loopback testing.

As you mentioned the Loopback issue, I aslo tested this case between two PC. One is Linux, the other is VM(NAT connection). We run rping server on PC, but when run client on the VM, the server is crashed, no response for any action.

You said that you have try the 4.8-rc5. I want to know how you achieve it, use rxe-dev branch or just upgrade the kernel? I want to continue the testing , thanks!

Best Regards
Anthony

@mcfatealan
Copy link
Author

Hi @anthonyliubin , I'm sorry to hear that you've had the same issue. The thing is that unluckily I still didn't pass the test in the end. My purpose was to find a temporary solution to test my RDMA codes before our server was fixed. The time spent on this project exceeded my expected limits, so I had to give up. But still I'd like to thank @yonatanco for all of his help!

About 4.8-rc5, I just upgraded my kernel.

It's kinda embarrassing that my answer might not provide any help. Anyway, that's all I know. Hope for the best!

@anthonyliubin
Copy link

hi, @mcfatealan
Thanks for your response.
I have a question, if we do not use rxe-dev branch, just upgrade the kernel, how to keep the rxe dev package in the new kernel? In my mind, if we compile the new kernel, it does not include the rxe. Do we need to port rxe? It is a big work.
If you could give a simple explaination on how to upgrade, it will help us more! Thanks.

Best Regards
Anthony

@mcfatealan
Copy link
Author

I'm not 100% sure since it's been a while, but according to the description of @yonatanco , seems that rxe-dev already included in 4.8.0? I suggest you give a try :)

@anthonyliubin
Copy link

hi, @mcfatealan
Thanks for your help.The rxe-dev already included in 4.8-rc5.
We have tested this case in 4.7 and 4.8-rc5, both results are OK now(ibv_rc_pingpong and rping).
In our testing, we need 2 PCs, Bridge connecting(No NAT, if use VM),clear iptables rules at first.
ibv_rc_pingpong need use gid 1.
And it does not support loop testing.

Best Regards
Anthony

@mcfatealan
Copy link
Author

@anthonyliubin congrats! so glad to hear that :) The points you mentioned are very helpful. Maybe I will test again next time according to your experience.

@oTTer-Chief
Copy link

Hi,
I try to get rxe running on Debian 8.7 with Kernel 4.8.15 (rdma_rxe version 0.2) and face exactly the same issues. Neither rping nor ib_rc_pingpong are sending data if both run on the same machine.

In case of rping I get:

hutter@cbm01:~$ tail -n1 /var/log/messages
Jan 25 13:28:05 cbm01 kernel: [54644.488129] detected loopback device

Should this work loopback/on the same machine or is this unsupported?

@anthonyliubin
Copy link

hi, @oTTer-Chief

In my testing, it could not work loopback/on the same machine.
You could try it via 2 PC.

Best Regards
Anthony

@oTTer-Chief
Copy link

Hi @anthonyliubin ,

I tried testing between 2 VMs and this worked.
Nevertheless I wonder if the loopback is intended to work and there is an error in my setup or if loopback is explicitly unsupported.
If I have real RDMA hardware like Infiniband I am able to send to the same machine so I would assume the software representation is also able to do this.

@Hunter21007
Copy link

Hi all,

Communication with the same machine is also required with GlusterFS RDMA transport...(which I was not able to do with Linux 4.9)

@Peng-git-hub
Copy link

you may try this:
first:make sure that message can pass through the firewall
iptables -F; iptables -t mangle -F
then:add the IP address of both server and client to “trusted list”
firewall-cmd --zone=trusted --add-source=1.1.1.1 --permanent
firewall-cmd --zone=trusted --add-source=1.1.1.2 --permanent

@Hunter21007
Copy link

Is this nesessary also if firewall is disabled?

@Peng-git-hub
Copy link

The default firewall rule is rejecting the unknown connection, and the direct test will be rejected by the remote firewall

@byronyi
Copy link

byronyi commented Jul 9, 2017

Any updates? Seems the loopback interface is not functioning for RDMACM, which is crucial for testing and local development.

@monis410
Copy link
Contributor

The RXE project maintenance in Github was stopped. You should move to upstream linux for kernel module and rdma-core (https://github.com/linux-rdma/rdma-core) for userspace library to get the latest features and bug fixes.
Note that it is possible that some of the bugs you meet have fixes in drivers/infinibad/core (which means that they are common to all infiniband provideres)

@byronyi
Copy link

byronyi commented Jul 10, 2017

Thanks for your comment!

@githubfoam
Copy link

@Hunter21007 I also tried GlusterFS RDMA transport with kernel 4.9.0. what do you mean same machine? I have 2 VMs 2x NIC .1x NAT 1x host-only.did you get glusterfs rdma transport running with soft-roce?

@byronyi
Copy link

byronyi commented May 22, 2018

@githubfoam According to my inquiry in the linux-rdma mailing list, several RXE bugs were fixed in 4.9/10/11, and you are suggested to upgrade to 4.14/15 (e.g. Ubuntu 18.04 or Debian unstable). If the problem persists, let us know.

@Hunter21007
Copy link

@githubfoam Host only means glusterfs server and client on same machine via 127.0.0.1. No I was not successful to make it work. And now it is even out of scope. Because glusterfs rdma support was dropped. So this one is not relevant anymore.

@githubfoam
Copy link

githubfoam commented May 23, 2018

@Hunter21007 could you provide a link that shows glusterfs-rdma support is dropped ? Over here site suggests two links.However both links end nowhere.
https://docs.gluster.org/en/v3/Administrator%20Guide/RDMA%20Transport/
Normally I build two servers NAT-ed in the same network.Glusterfs server/client works.TCP works but RDMA network does not.

@githubfoam
Copy link

@byronyi I tried what's suggested on this website.My nodes are configured with "Ubuntu 16.04.4 LTS-Linux 4.7.0-rc3+" after installing kernel/user spaces. I can't play ping pong.rxe testing fails. I dont get how you upgrade to 4.14-15 kernels.With this kernel spaces it is upgraded from "4.4.0-116-generic" to " 4.7.0-rc3+"
https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
One of contributers say this github is not maintained anymore and suggests to follow "upstream kernel+rdma-core" method which is the one below link.
https://community.mellanox.com/docs/DOC-2184
So I started trying this. My nodes are "Ubuntu 16.04.4 LTS-Kernel: Linux 4.17.0-rc6" "after kernel/rdma-core" installations.Problem is that there are missing steps .Like "sudo make install". At the bottom of the page someone tried and steps are different.

@lalith-b
Copy link

I am able to do ping pong with rxe but rdma_cm is failing when it comes to gluster-rdma support. the port 24008 is never started due to rdma_cm fails with [No Device Found]

@githubfoam
Copy link

@ lalith-b if you read whole thread glusterfs rdma was dropped at that time.If you have information that says otherwise could you please share? The point I left was TCP worked but RDMA did not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants