Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency / Lost messages: Pub/Sub 10B at high pub frequency. #198

Open
mauropasse opened this issue Jun 7, 2024 · 11 comments
Open

High latency / Lost messages: Pub/Sub 10B at high pub frequency. #198

mauropasse opened this issue Jun 7, 2024 · 11 comments
Assignees
Labels
backlog Long-term improvement or addition

Comments

@mauropasse
Copy link

Hi! I've running some ROS2 benchmarks on rmw_zenoh (and other rmw's) using the iRobot ros2-performance framework, this time on a Raspberry Pi 4B and ROS2 Iron.
The system tested consists on 2 processes, one for the publisher, one for the subscription.

- Pub/Sub - 10B - 2KHz
- Pub/Sub - 100KB - 250Hz
- Pub/Sub - 1MB - 50Hz
- Pub/Sub - 4MB - 10Hz

I noticed a spike on latency for the pub/sub - 10B - 2KHz scenario.
image
I've been able to reproduce these results using also lower pub frequencies, and is noticeable also on x86.

From x86 benchmark output logs I got:
image
On RPi4B, inspecting the events.txt I see every message being late, like if it has lost sync and a msg is read when a new one arrives:

Time_ms     Caller                   Code  Description         
1474        test->sub_node           1     msg 0 late. 1699us > 400us
1492        test->sub_node           1     msg 9 late. 1691us > 400us
1492        test->sub_node           1     msg 10 late. 491us > 400us
1495        test->sub_node           1     msg 11 late. 519us > 400us
1496        test->sub_node           1     msg 12 late. 480us > 400us
1499        test->sub_node           1     msg 13 late. 528us > 400us
1500        test->sub_node           1     msg 14 late. 421us > 400us
1502        test->sub_node           1     msg 15 late. 440us > 400us
1504        test->sub_node           1     msg 16 late. 451us > 400us
1506        test->sub_node           1     msg 17 late. 458us > 400us

So it looks like something starts failing when pub frequency gets high enough.

Something strange is that I run also a single-process test, with rclcpp intra-process disabled (pub/sub in same process) and the issue goes away, so it seems the issue is with multi-process applications.

@mjcarroll
Copy link
Member

Something strange is that I run also a single-process test, with rclcpp intra-process disabled (pub/sub in same process) and the issue goes away, so it seems the issue is with multi-process applications.

So just to be clear:

  • pub/sub in same process, intraprocess on is fine (expected)
  • pub/sub in same process, intraprocess off is fine (reasonably expected)
  • pub/sub in different processes is not fine (unexpected)

@kydos
Copy link

kydos commented Jun 7, 2024

How are you measuring the latency? If you do one way measurements, then for small data packets the behaviour you are seeing is probably due to batching (at Zenoh and TCP/IP level). If you want to skip batching please leverage the ultra-low latency transport (see https://zenoh.io/blog/2023-10-03-zenoh-dragonite/#support-for-ultra-low-latency).

Don't hesitate to reach us out on Zenoh's Discord server if you have any questions on configuration.

We'll try to reproduce on our side to see if we get a similar behaviour. On hour side it would be good to see if you have a similar behaviour running the latency test shipped with Zenoh (See https://github.com/eclipse-zenoh/zenoh/tree/main/examples/examples)

-- kydos

@mauropasse
Copy link
Author

Hi, thanks for the suggestions.
Before running my ROS2 app, I start zenoh with ./rmw_zenohd.
Does this executable take config files as input? Like: ./rmw_zenohd -c lowlatency.json5, or there is other way?
I'm asking because I don't get any output from ./rmw_zenohd --help.
I'll try to get data from the latency test you mention.

How are you measuring the latency?

The iRobot ros2-performance benchmark shows in console and generates files with info about resources, latency, publish duration, CPU, memory, etc. We write the timestamp to the message, so when the subscription receives the message, it can obtain the latency.

@Yadunund
Copy link
Member

Yadunund commented Jun 7, 2024

Does this executable take config files as input? Like: ./rmw_zenohd -c lowlatency.json5, or there is other way?

You can find information in our README here on how to configure the session/router with non-default config files.

@kydos
Copy link

kydos commented Jun 10, 2024

Hello @mauropasse, the lowlatency.json5 configuration should be passed to the ROS applications. It is not essential for the router as in the default configuration it is used only for off the robot communication. This can be done by means of session configuration as explained on the "Session and Router Configs" section of the Zenoh RMW README. Let me know if this does the trick.

@imstevenpmwork
Copy link
Contributor

Hello @mauropasse and @mjcarroll,
Would you mind explain us the configuration of the test you ran so we can replicate it on our side? Ideally the topology configuration file :D

@mauropasse
Copy link
Author

mauropasse commented Jun 11, 2024

I have tried using the configuration file to enable low latency mode, and it seems to have fixed the issue for the 10B message.
image

I have used the DEFAULT_RMW_ZENOH_SESSION_CONFIG.json5 but setting lowlatency:true and qos:disabled.

But for some reason, it doesn't work for bigger messages (100KB, 1MB, etc): The subscription doesn't get any messages, like if the nodes were not being discovered. Besides this problem, the publisher process seems to hang indefinitely.

I still see the router messages from both processes, [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id.

I've experienced this behavior both on x86 and in RPi4.

@imstevenpmwork the test topology is just a subscription to a 100KB message in one process, and a publisher at 250Hz in the other process. Let me know if you need more details.

@imstevenpmwork
Copy link
Contributor

Hello @mauropasse ,
The lowlatency feature is specifically designed to function within the 64K size limit. It provides a latency optimisation for handling small messages without fragmentation support. Regarding the topology, would it be possible to send us the .json file you are loading to the test? So we make sure we're running exactly the same test setup 😄

@mauropasse
Copy link
Author

Hi @imstevenpmwork, yes I was just testing and found that 64K is the limit in which things don't work.
So what happens on a system having both small messages at high freq and big messages? Either we miss the small messages or the big ones?

... would it be possible to send us the .json ...

# Publisher json
{"nodes": [{"node_name": "pub_node","publishers": [{"topic_name": "test","msg_type": "stamped_vector","msg_size": 64000,"freq_hz": 250}]}]}
# Subscriber json
{"nodes": [{"node_name": "sub_node","subscribers":[{"topic_name":"test", "msg_type":"stamped_vector"}]}]}

@imstevenpmwork
Copy link
Contributor

Hi @mauropasse

I believe the configuration name is a bit confusing and may need to be changed in the future. Essentially, it offers an option to slightly optimize for latency, provided you're dealing with small messages only. If this option is disabled, as it should be when handling large messages, the optimization won't occur. However, this doesn't mean that Zenoh (and Zenoh RMW) in its default configuration is slow for small messages.

Since Zenoh RMW is still under development, we're constantly identifying and fixing issues. That's why reports like yours are incredibly valuable to us 😄

I'll test the topology configuration on our end and let you know if we get similar results. If we do, we'll investigate further to identify the cause and develop a fix to ensure Zenoh RMW is as performant as Zenoh itself despite the message size used

@fujitatomoya
Copy link

/assign @clalancette

@sloretz sloretz added the backlog Long-term improvement or addition label Jul 12, 2024
@clalancette clalancette self-assigned this Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Long-term improvement or addition
Projects
None yet
Development

No branches or pull requests

8 participants