Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(serial-reopening): bug that prevented the serial link to be reused after closing a session #1624

Merged
merged 14 commits into from
Dec 11, 2024

Conversation

gabrik
Copy link
Contributor

@gabrik gabrik commented Dec 2, 2024

Fixing a bug that prevented the serial link to be reused after closing a session.

Serial link remains 1-to-1 and exclusive, meaning that if a session is opened on a serial device no other sessions can use it.
once the first session closes, then another session can connect to the serial listener.

…opened over a serial link

Signed-off-by: Gabriele Baldoni <[email protected]>
@gabrik gabrik added the bug Something isn't working label Dec 2, 2024
@gabrik gabrik self-assigned this Dec 2, 2024
@gabrik gabrik requested a review from Mallets December 2, 2024 15:01
Copy link

@ur8us ur8us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gabriele, thank you so much for fixing this bug! I have tested the updated version with zenoh-pico/RP2350/Arduino and confirm there is no manual re-start of the zenoh router required for the microprocessor to start a new session! Now we can continue porting the zenoh-pico library to the STM32/ThreadX platform.

However, according to the practical tests, the current fix still does not make the serial connection as efficient as it should be.

Let me please explain.

The robot is connected to the zenoh router normally, aiming to access ROS2's turtle topics:

2024-12-02T18:00:36.865912Z DEBUG acc-0 ThreadId(12) zenoh_transport::unicast::manager: New transport opened between cedbdf0bcb5dfeffc67b643ca0f80f07 and d9fcf12b3fffa768d878b4042946cf2d - whatami: client, sn resolution: U32, initial sn: 161578089, qos: false, multilink: false, lowlatency: false
2024-12-02T18:00:36.865928Z DEBUG acc-0 ThreadId(12) zenoh_transport::unicast::establishment::accept: New transport link accepted from d9fcf12b3fffa768d878b4042946cf2d to cedbdf0bcb5dfeffc67b643ca0f80f07: TransportLinkUnicast { link: Link { src: serial//dev/ttyACM0, dst: serial/f33dcabb-919a-47df-b7a2-d1b0033a981e, mtu: 1500, is_reliable: false, is_streamed: false }, config: TransportLinkUnicastConfig { direction: Inbound, batch: BatchConfig { mtu: 1500, is_streamed: false, is_compression: false }, priorities: None, reliability: None } }
2024-12-02T18:00:36.873634Z DEBUG rx-1 ThreadId(15) zenoh::net::routing::dispatcher::resource: Register resource turtle1/cmd_vel
2024-12-02T18:00:36.873740Z DEBUG rx-1 ThreadId(15) zenoh::net::routing::dispatcher::pubsub: Face{5, d9fcf12b3fffa768d878b4042946cf2d} Declare subscriber 1 (turtle1/cmd_vel)

Everything goes in a normal way and the robot is able to receive data from the teleop program.

Then, once the microcontroller is restarted, it immediately tries connecting to the zenoh router, over and over again, so to get connected as soon as possible. Is not it a normal behavior? I think, yes. For many reasons, including safety, we should let the robot establish connection to the router as soon as possible. However, currently we get an infinite loop:

2024-12-02T18:00:42.229102Z DEBUG rx-1 ThreadId(15) zenoh_transport::unicast::universal::rx: Transport: d9fcf12b3fffa768d878b4042946cf2d. Message handling not implemented: TransportMessage { body: InitSyn(InitSyn { version: 9, whatami: Client, zid: d9fcf12b3fffa768d878b4042946cf2d, resolution: Resolution(10), batch_size: 2048, ext_qos: None, ext_qos_link: None, ext_auth: None, ext_mlink: None, ext_lowlatency: None, ext_compression: None }) }
2024-12-02T18:00:45.466414Z DEBUG rx-1 ThreadId(15) zenoh_transport::unicast::universal::rx: Transport: d9fcf12b3fffa768d878b4042946cf2d. Message handling not implemented: TransportMessage { body: InitSyn(InitSyn { version: 9, whatami: Client, zid: d9fcf12b3fffa768d878b4042946cf2d, resolution: Resolution(10), batch_size: 2048, ext_qos: None, ext_qos_link: None, ext_auth: None, ext_mlink: None, ext_lowlatency: None, ext_compression: None }) }
2024-12-02T18:00:48.550930Z DEBUG rx-1 ThreadId(15) zenoh_transport::unicast::universal::rx: Transport: d9fcf12b3fffa768d878b4042946cf2d. Message handling not implemented: TransportMessage { body: InitSyn(InitSyn { version: 9, whatami: Client, zid: d9fcf12b3fffa768d878b4042946cf2d, resolution: Resolution(10), batch_size: 2048, ext_qos: None, ext_qos_link: None, ext_auth: None, ext_mlink: None, ext_lowlatency: None, ext_compression: None }) }
2024-12-02T18:00:50.840401Z DEBUG rx-1 ThreadId(15) zenoh_transport::unicast::universal::rx: Transport: d9fcf12b3fffa768d878b4042946cf2d. Message handling not implemented: TransportMessage { body: InitSyn(InitSyn { version: 9, whatami: Client, zid: d9fcf12b3fffa768d878b4042946cf2d, resolution: Resolution(10), batch_size: 2048, ext_qos: None, ext_qos_link: None, ext_auth: None, ext_mlink: None, ext_lowlatency: None, ext_compression: None }) }

And so on, without any luck for the robot to reconnect to the zenoh router, other than to wait 20 seconds (or more, if we are trying to connect with increasing intervals).

This is very impractical.

Question: is it possible to start a new connection, say, immediately when the InitSyn message is received?

Thanks again!

@gabrik
Copy link
Contributor Author

gabrik commented Dec 3, 2024

And so on, without any luck for the robot to reconnect to the zenoh router, other than to wait 20 seconds (or more, if we are trying to connect with increasing intervals).

This is very impractical.

Question: is it possible to start a new connection, say, immediately when the InitSyn message is received?

Thanks again!

Hi @ur8us, what you are experiencing is the session timeout, the microcontroller closes its serial without closing the sesssion.
Thus, on the other side the session its still alive, until the lease timeout its reached. This timeout by defaut is set to 4 times the keepalive interval, but it can be configured: see Zenoh default config.

This has nothing to do with the serial link itself, its a general Zenoh configuration that applies to all links.

@Mallets
Copy link
Member

Mallets commented Dec 3, 2024

Said that, we may still want to support some kind of serial-level (i.e. below zenoh) connection reset to support the use case.

@davecrawley
Copy link

One other question - I note that it can only maintain one Serial connection. However its not clear is that one serial connection per

a) Serial port available on the board (e.g. one computer has 3 full duplex serial ports like the raspberry pi you can maintain 3 connections)
b) Instance of Zenoh router operating (e.g. you could run multiple instances of zenoh router and operate on multiple serial ports)
c) Computer (e.g. even if a computer has many serial ports - like most do - you can only use one of them no matter how many instances of zenoh router are running)

For sure we are going to need to operate with all the serial ports on our Raspberry Pi setup and on other future set ups that we are working on. Its very common to have many low speed peripherals and one of our planned system needs 6 serial ports.

@Mallets
Copy link
Member

Mallets commented Dec 3, 2024 via email

@ur8us
Copy link

ur8us commented Dec 3, 2024

Hi! Let me please struggle for microcontroller's rights :-)

As my experiment has discovered, the current fix does not reset the session by timeout, as desired. Instead, a deadlock occurs:

  1. The microcontroller tries to connect to the zenoh router after being reset;
  2. The zenoh router treats these connect attempts as keep-alives (?) and does not reset the previous session by timeout.
  3. GOTO 1.

Workarounds: force the microcontroller send a "close session" message before resetting -or- wait 20 seconds or more between trying to re-connect to the zenoh router. The first one is impossible, the second one is impractical.

The right way, as I see it: by receiving InitSyn, the zenoh router should immediately drop the old session to be prepared for the new one.

@Mallets
Copy link
Member

Mallets commented Dec 3, 2024

Hi! Let me please struggle for microcontroller's rights :-)

As my experiment has discovered, the current fix does not reset the session by timeout, as desired. Instead, a deadlock occurs:

  1. The microcontroller tries to connect to the zenoh router after being reset;
  2. The zenoh router treats these connect attempts as keep-alives (?) and does not reset the previous session by timeout.
  3. GOTO 1.

Workarounds: force the microcontroller send a "close session" message before resetting -or- wait 20 seconds or more between trying to re-connect to the zenoh router. The first one is impossible, the second one is impractical.

The right way, as I see it: by receiving InitSyn, the zenoh router should immediately drop the old session to be prepared for the new one.

Agree. But as I said in my previous comment, we would need to add some kind of serial-level (i.e. below zenoh) connection reset to support the use case. So, the right way is handling it is at serial level because it's a serial related issue.

Signed-off-by: Gabriele Baldoni <[email protected]>
@gabrik
Copy link
Contributor Author

gabrik commented Dec 4, 2024

Fixed the issue with the timeout
this depends on: ZettaScaleLabs/z-serial#9

Changes from the PR on z-serial needs to be ported to zenoh pico for interoperability purposes: @sashacmc can you do this?

Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
Copy link
Member

@sashacmc sashacmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with Pico in:
eclipse-zenoh/zenoh-pico#823

Signed-off-by: Gabriele Baldoni <[email protected]>
@gabrik
Copy link
Contributor Author

gabrik commented Dec 10, 2024

The CI failures are because z-serial is not yet released.

Signed-off-by: Gabriele Baldoni <[email protected]>
Signed-off-by: Gabriele Baldoni <[email protected]>
@gabrik gabrik enabled auto-merge (squash) December 11, 2024 10:37
@gabrik gabrik merged commit c50717f into main Dec 11, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants