Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinity loop on select() sleep with VMA_SPEC=latency #1098

Open
kc-eos opened this issue Dec 16, 2024 · 5 comments
Open

Infinity loop on select() sleep with VMA_SPEC=latency #1098

kc-eos opened this issue Dec 16, 2024 · 5 comments

Comments

@kc-eos
Copy link

kc-eos commented Dec 16, 2024

Hi, I encountered an issue when integrating libvma with the Reuters library.
I am running in the latency profile but it seems that the user thread was busy spinning and stuck inside

#0  0x00007fab9c3eb81f in select () from /lib64/libc.so.6
#1  0x00007faba1ea0ceb in select_call::wait_os(bool) () from /lib64/libvma.so
#2  0x00007faba1e9db17 in io_mux_call::polling_loops() () from /lib64/libvma.so
#3  0x00007faba1e9f565 in io_mux_call::call() () from /lib64/libvma.so
#4  0x00007faba1f08048 in select_helper(int, fd_set*, fd_set*, fd_set*, timeval*, __sigset_t const*) () from /lib64/libvma.so
...

Step to reproduce

After some investigation, I wrote a simple program to narrow down the usage and able to reproduce the issue:

#include <iostream>
#include <sys/socket.h>
#include <unistd.h>
#include <sys/select.h>
#include <ctime>

void printTime() {
    std::time_t now = std::time(nullptr);
    std::cout << std::ctime(&now);  // Print current time
}

int main() {
    // Create a stream socket
    int sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if (sockfd < 0) {
        std::cerr << "Error opening socket" << std::endl;
        return 1;
    }

    struct timeval tv;
    tv.tv_sec = 1;  // 1 seconds
    tv.tv_usec = 0;

    printTime();
    select(0, nullptr, nullptr, nullptr, &tv); // sleep by calling select(0..);
    printTime();

    // Close the socket
    close(sockfd);
    return 0;
}

Environment

  1. RedHat Enterprise Linux 8.7
  2. g++ 8.5.0-15
  3. libvma 9.8.60

Test result

$ g++ select_sleep.cpp -o test
$ ./test ## OK
$ LD_PRELOAD=libvma.so.9.8.60 ./test ## OK
$ LD_PRELOAD=libvma.so.9.8.60 VMA_SPEC=latency ./test ## Failed, thread stucks

Workaround

With some more trial-and-error, it seems that this issue can be workaround by disabling VMA_SELECT_POLL_OS_FORCE. i.e.

LD_PRELOAD=libvma.so.9.8.60 VMA_SPEC=latency VMA_SELECT_POLL_OS_FORCE=0 ./test ## OK

However, by doing this will unset the other 2 parameters VMA_SELECT_SKIP_OS & VMA_SELECT_POLL_OS_RATIO back to default value. Hence. my question:

  1. Is this a by-design behaviors in libvma or it is simply a bug?
  2. In order to workaround and proceed in time being, what will be the recommended setting for VMA_SELECT_SKIP_OS & VMA_SELECT_POLL_OS_RATIO ?
@igor-ivanov
Copy link
Collaborator

Hello @kc-eos I think that this behavior is described at https://github.com/Mellanox/libvma/blob/master/README#L629-L641

@kc-eos
Copy link
Author

kc-eos commented Dec 19, 2024

Hi @igor-ivanov , thanks for the reply.

I also noticed the description, but I still can't believe that libvma should block the user thread forever in any cases..

On the other hand, when looking at other libvma documentation, it mentions that:

  1. At: https://github.com/Mellanox/libvma/wiki/Architecture

If the data is routed to/from an supported network adapter, the VMA library intercepts the call and does the bypass work. If the data is passing to/from an unsupported network adapter, the VMA library passes the call to the usual kernel libraries responsible for handling network traffic.

  1. At: https://github.com/Mellanox/libvma/blob/master/README#L616-L627

The duration in micro-seconds (usec) in which to poll the hardware on Rx path before
going to sleep (pending an interrupt blocking on OS select(), poll() or epoll_wait().
The max polling duration will be limited by the timeout the user is using when
calling select(), poll() or epoll_wait().

Based on the above descriptions, I think one could expect the call select(0, nullptr, nullptr, nullptr, &tv); shall return after the 1 seconds, as defined by struct timeval tv.

Could you please discuss this with the team again to check if this is really an expected behaviors and consider fixing it?

One more thing: FYI, I tested the same program with an older version of libvma - 8.1.4, and it works fine without blocking the user thread!!

@igor-ivanov
Copy link
Collaborator

@galnoam probably some degradation is reported.

@kc-eos
Copy link
Author

kc-eos commented Dec 24, 2024

Hi @igor-ivanov & @galnoam , Merry Christmas!!

May I know is there any update on this thread?

@galnoam
Copy link
Collaborator

galnoam commented Dec 24, 2024

@AlexanderGrissik, check the reported issue?
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants