[Access] Implement keepalive routine with ping-ponging to ws connection in ws controller #6757

UlyanaAndrukhiv · 2024-11-25T09:27:05Z

Closes: #6639

Note: ##6750 should be merged first

Context

In this pull request implemented a keepalive routine to monitor WebSocket network connectivity using ping-pong messages.

Key Changes

Integrated a keepalive logic with for WebSocket controller to detect and handle connectivity issues and ensured the keepalive operates independently, managing ping-pong messages and monitoring connection health.
Added mechanisms to disable the WebSocket connection and gracefully shut down affected components when issues are detected.
Added tests to validate connectivity issue detection and proper shutdown handling, including edge cases like missed pong responses and delayed ping intervals.

…ring

…ub.com:The-K-R-O-K/flow-go into UlyanaAndrukhiv/6639-ws-ping-pong

…drukhiv/6639-ws-ping-pong

codecov-commenter · 2024-11-25T09:30:49Z

Codecov Report

Attention: Patch coverage is 62.34818% with 93 lines in your changes missing coverage. Please review.

Project coverage is 41.24%. Comparing base (8a3055c) to head (1f5728d).

Files with missing lines	Patch %	Lines
engine/access/rest/websockets/controller.go	69.64%	27 Missing and 7 partials ⚠️
...ccess/rest/websockets/mock/websocket_connection.go	70.00%	11 Missing and 13 partials ⚠️
engine/access/rest/websockets/connections.go	0.00%	18 Missing ⚠️
...ockets/data_provider/mock/data_provider_factory.go	73.68%	2 Missing and 3 partials ⚠️
engine/access/rest/websockets/handler.go	0.00%	5 Missing ⚠️
...ne/access/rest/websockets/data_provider/factory.go	40.00%	3 Missing ⚠️
engine/access/rest/router/router.go	0.00%	2 Missing ⚠️
...access/rest/websockets/legacy/websocket_handler.go	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6757      +/-   ##
==========================================
- Coverage   41.26%   41.24%   -0.02%     
==========================================
  Files        2061     2064       +3     
  Lines      182702   182900     +198     
==========================================
+ Hits        75384    75438      +54     
- Misses     101010   101132     +122     
- Partials     6308     6330      +22

Flag	Coverage Δ
unittests	`41.24% <62.34%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

engine/access/rest/websockets/legacy/websocket_handler.go

…ub.com:The-K-R-O-K/flow-go into UlyanaAndrukhiv/6638-ws-connection-configuring

…ring

engine/access/rest/websockets/controller.go

Guitarheroua

After the first round of review - looks cool!

…nality

engine/access/rest/websockets/controller.go

… and tests

…dded some refactoring, added godoc

…R-O-K/flow-go into UlyanaAndrukhiv/6639-ws-ping-pong

illia-malachyn · 2024-11-28T14:05:06Z

engine/access/rest/websockets/controller.go

+// sendPing sends a periodic ping message to the WebSocket client to keep the connection alive.
+//
+// No errors are expected during normal operation.
+func (c *Controller) sendPing() error {


What does this abstraction do? Can't we get rid of it and use this code directly in keep-alive routine ?

illia-malachyn · 2024-11-28T14:07:04Z

engine/access/rest/websockets/connections.go

@@ -0,0 +1,57 @@
+package websockets


Why has this file name been changed to connections?

illia-malachyn · 2024-11-28T15:15:58Z

engine/access/rest/websockets/controller.go

+			if err := c.sendPing(); err != nil {
+				// Log error and exit the loop on failure
+				c.logger.Error().Err(err).Msg("failed to send ping")
+				return err


We should stop keep-alive only if CloseErr was send to connection. However, I guess I will handle it in #6642 as it will be clear till that time

peterargue · 2024-11-28T21:41:22Z

engine/access/rest/websockets/controller.go

+	// This value must be less than pongWait.
+	PingPeriod = (PongWait * 9) / 10
+
+	// PongWait specifies the maximum time to wait for a pong message from the peer.


is this accurate?

Suggested change

// PongWait specifies the maximum time to wait for a pong message from the peer.

// PongWait specifies the maximum time to wait for a pong response message from the peer

// after sending a ping

peterargue · 2024-11-28T21:42:47Z

engine/access/rest/websockets/controller.go

 	"github.com/onflow/flow-go/utils/concurrentmap"
 )

+const (
+	// PingPeriod defines the interval at which ping messages are sent to the client.
+	// This value must be less than pongWait.


why less? intuitively I would have thought it would need to be larger. Can you elaborate more in this comment

I guess Ulyana took it from here https://github.com/gorilla/websocket/blob/v1.5.3/examples/chat/client.go#L23.
I believe it's because ping and pong share the same timer.

Let’s consider a case where pongWait is smaller than pingPeriod, and we’ll see why this configuration is problematic.

Parameters:
pongWait = 30s
pingPeriod = 40s

At t=0:
The server sends a ping message to the client.

At t=30s:
The pongWait expires because the server hasn't received a pong (or any message) from the client.
The server assumes the connection is dead and closes it.

At t=40s:
The server sends its second ping, but the connection is already closed due to the timeout at t=30s.

yea, but in that case, the server should have cleaned up the ping service when the connection was closed, so the second ping would never happen

peterargue · 2024-11-28T21:44:46Z

engine/access/rest/websockets/controller.go

+	// PongWait specifies the maximum time to wait for a pong message from the peer.
+	PongWait = 10 * time.Second
+
+	// WriteWait specifies the maximum duration allowed to write a message to the peer.


there is a good explanation of what this means in the code. Can you elaborate some more here. Mostly, readers will look to the definitions to understand how to set/modify the values.

peterargue · 2024-11-28T21:49:50Z

engine/access/rest/websockets/controller.go

+	select {
+	case err, ok := <-c.errorChannel:
+		if !ok {
+			c.logger.Error().Msg("error channel closed")


is this possible? It looks like the channel is only closed in a defer

peterargue · 2024-11-28T22:00:28Z

engine/access/rest/websockets/controller.go

-	c.writeMessagesToClient(ctx)
+
+	// for track all goroutines and error handling
+	var wg sync.WaitGroup


what do you think about using errgroup here instead? it handles both the goroutine lifecycle and passing errors. it has the added benefit that if an error is returned from any other goroutine, the shared context is canceled.

it would look something like

g, ctx := errgroup.WithContext(ctx) g.Go(func() error { return c.readMessagesFromClient(ctx) }) ... err := g.Wait() if err != nil { c.shutdownConnection() } ``

I'd leave this decision to be made in #6642 as the error handling/routines start might be changed
The goal of this PR is to introduce keep-alive.

why wait? Is this code being introduced in another PR?

peterargue · 2024-11-28T23:27:35Z

engine/access/rest/websockets/controller.go

-	defer func(conn *websocket.Conn) {
-		if err := c.conn.Close(); err != nil {
-			c.logger.Error().Err(err).Msg("error closing connection")
+	c.shutdownOnce.Do(func() {


sync.Once has the added functionality that it will block all other callers until the first completes. is that desired here? If not, you can just use an atomic bool with compare and swap

peterargue · 2024-11-28T23:31:57Z

engine/access/rest/websockets/controller.go

+			if err := c.conn.Close(); err != nil {
+				c.logger.Error().Err(err).Msg("error closing connection")
+			}
+			close(c.communicationChannel)


it's not safe to close this here, because the data providrs could continue to write out new messages causing a panic.

going back to the original design, we had said that data providers would get a callback function they could call to put new data on this queue. That would allow the controller to decide when to stop accepting new messages and to close this channel safely.

There's a code below that stops every provider. So, as this function call is deferred, all data providers will be stopped and "unregistered" when this code executes.

the code calls Close() which simply calls the cancel() function. there's no guarantee the providers have actually stopped.
https://github.com/onflow/flow-go/pull/6636/files#diff-5cbca3503bb00261318db4a8b8b1714447f7348a4d87e11aaf8b37b43b36e2bbR49-R52

peterargue · 2024-11-28T23:36:16Z

engine/access/rest/websockets/controller.go

 	for {
 		select {
 		case <-ctx.Done():
-			return
-		case msg := <-c.communicationChannel:
+			return ctx.Err()


when the controller is updated to allow c.communicationChannel to be close when shutdown is triggered, you can remove this check, and instead rely on the channel closing to signal shutdown.

as it is now, if this returns here, the data providers may hang waiting to push onto the queue

peterargue · 2024-11-28T23:40:29Z

engine/access/rest/websockets/controller.go

+		case <-pingTicker.C:
+			if err := c.sendPing(); err != nil {
+				// Log error and exit the loop on failure
+				c.logger.Error().Err(err).Msg("failed to send ping")


This will get noisy in the logs at error level

Suggested change

c.logger.Error().Err(err).Msg("failed to send ping")

c.logger.Debug().Err(err).Msg("failed to send ping")

illia-malachyn · 2024-11-29T14:35:12Z

engine/access/rest/websockets/controller.go

+	PingPeriod = (PongWait * 9) / 10
+
+	// PongWait specifies the maximum time to wait for a pong message from the peer.
+	PongWait = 10 * time.Second


Wouldn't it be better to place it in websockets.Config type?

illia-malachyn · 2024-11-29T14:54:33Z

engine/access/rest/websockets/controller.go

+		err := process(ctx)
+		if err != nil {
+			// Check if shutdown has already been called, to avoid multiple shutdowns
+			if c.shutdown {


This is a data race, isn't it? I'm thinking of the following situation:

One of the processes crashes with some error without calling shutdown. (e.g. keepalive routine)

So, we got to this code when we read c.shutdown on this line

Simultaneously, another process (e.g. reader routine) called shutdown and touches c.shutdown concurrently.

If we need this, we have to use an atomic variable here. However, I don't understand why we need it, can you elaborate on it?

UlyanaAndrukhiv added 9 commits November 20, 2024 14:03

Added Websocket connection configurating

81ddee5

Updated configureConnection and godoc

808b54b

Adedd SetWriteDeadline before write operation

6c5ab5d

Set initital read deadline, updated godoc

eec15e5

Merge branch 'master' into UlyanaAndrukhiv/6638-ws-connection-configu…

fd567aa

…ring

Merge branch 'master' into UlyanaAndrukhiv/6638-ws-connection-configu…

098c10d

…ring

Implemented ping-pong ws routine, refactored shutdownConnection

917bbde

Merge branch 'UlyanaAndrukhiv/6638-ws-connection-configuring' of gith…

438b130

…ub.com:The-K-R-O-K/flow-go into UlyanaAndrukhiv/6639-ws-ping-pong

Merge branch 'master' of github.com:The-K-R-O-K/flow-go into UlyanaAn…

86cdb35

…drukhiv/6639-ws-ping-pong

Guitarheroua self-requested a review November 25, 2024 10:14

Guitarheroua assigned UlyanaAndrukhiv Nov 25, 2024

UlyanaAndrukhiv requested review from illia-malachyn and AndriiDiachuk November 25, 2024 10:43

Added more comments and updated godoc

ec4e247

Guitarheroua reviewed Nov 25, 2024

View reviewed changes

engine/access/rest/websockets/legacy/websocket_handler.go Outdated Show resolved Hide resolved

UlyanaAndrukhiv added 4 commits November 25, 2024 14:17

Moved constants to new websockets package according to comment

eae6bbf

Merge branch 'UlyanaAndrukhiv/6638-ws-connection-configuring' of gith…

4e2d35c

…ub.com:The-K-R-O-K/flow-go into UlyanaAndrukhiv/6638-ws-connection-configuring

Merge branch 'master' into UlyanaAndrukhiv/6638-ws-connection-configu…

9971188

…ring

Merged with UlyanaAndrukhiv/6638-ws-connection-configuring

6cd2841

Guitarheroua reviewed Nov 25, 2024

View reviewed changes

engine/access/rest/websockets/controller.go Outdated Show resolved Hide resolved

Guitarheroua reviewed Nov 25, 2024

View reviewed changes

engine/access/rest/websockets/controller.go Outdated Show resolved Hide resolved

Guitarheroua reviewed Nov 25, 2024

View reviewed changes

jribbink mentioned this pull request Nov 26, 2024

[Streaming] Implement ping-pong keepalive routine onflow/fcl-js#2026

Closed

UlyanaAndrukhiv added 2 commits November 26, 2024 16:16

Updated according to comments, added unit tests for ping-pong functio…

c90d75f

…nality

Merge branch 'master' into UlyanaAndrukhiv/6639-ws-ping-pong

afc8648

illia-malachyn mentioned this pull request Nov 26, 2024

[Access] Configure ws connection in ws controller #6750

Open

illia-malachyn reviewed Nov 26, 2024

View reviewed changes

engine/access/rest/websockets/controller.go Outdated Show resolved Hide resolved

UlyanaAndrukhiv added 2 commits November 27, 2024 13:26

Updated WriteMessage to WriteControl for Ping messages, updated mocks…

040a949

… and tests

Merge branch 'master' into UlyanaAndrukhiv/6639-ws-ping-pong

357dc2f

Guitarheroua mentioned this pull request Nov 28, 2024

[Access] Add unit test for websocket controller #6762

Open

UlyanaAndrukhiv added 3 commits November 28, 2024 15:22

Added tests for keepalive, configure connection, graceful shutdown, a…

276ea7e

…dded some refactoring, added godoc

Merge branch 'UlyanaAndrukhiv/6639-ws-ping-pong' of github.com:The-K-…

077c543

…R-O-K/flow-go into UlyanaAndrukhiv/6639-ws-ping-pong

Added happy case test for keepalive

21259ce

UlyanaAndrukhiv marked this pull request as ready for review November 28, 2024 13:48

UlyanaAndrukhiv requested a review from peterargue as a code owner November 28, 2024 13:48

Updated unit test for keep alive

1f5728d

illia-malachyn reviewed Nov 28, 2024

View reviewed changes

peterargue reviewed Nov 28, 2024

View reviewed changes

illia-malachyn reviewed Nov 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Access] Implement keepalive routine with ping-ponging to ws connection in ws controller #6757

[Access] Implement keepalive routine with ping-ponging to ws connection in ws controller #6757

UlyanaAndrukhiv commented Nov 25, 2024 •

edited

Loading

codecov-commenter commented Nov 25, 2024 •

edited

Loading

Guitarheroua left a comment

illia-malachyn Nov 28, 2024

illia-malachyn Nov 28, 2024

illia-malachyn Nov 28, 2024 •

edited

Loading

peterargue Nov 28, 2024

peterargue Nov 28, 2024

illia-malachyn Nov 29, 2024 •

edited

Loading

peterargue Nov 29, 2024

peterargue Nov 28, 2024

peterargue Nov 28, 2024

peterargue Nov 28, 2024

illia-malachyn Nov 29, 2024

peterargue Nov 29, 2024

peterargue Nov 28, 2024

peterargue Nov 28, 2024

illia-malachyn Nov 29, 2024

peterargue Nov 29, 2024

peterargue Nov 28, 2024

peterargue Nov 28, 2024

illia-malachyn Nov 29, 2024

illia-malachyn Nov 29, 2024 •

edited

Loading

	// PongWait specifies the maximum time to wait for a pong message from the peer.
	// PongWait specifies the maximum time to wait for a pong response message from the peer
	// after sending a ping

	c.logger.Error().Err(err).Msg("failed to send ping")
	c.logger.Debug().Err(err).Msg("failed to send ping")

[Access] Implement keepalive routine with ping-ponging to ws connection in ws controller #6757

Are you sure you want to change the base?

[Access] Implement keepalive routine with ping-ponging to ws connection in ws controller #6757

Conversation

UlyanaAndrukhiv commented Nov 25, 2024 • edited Loading

Context

Key Changes

codecov-commenter commented Nov 25, 2024 • edited Loading

Codecov Report

Guitarheroua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illia-malachyn Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illia-malachyn Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illia-malachyn Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

UlyanaAndrukhiv commented Nov 25, 2024 •

edited

Loading

codecov-commenter commented Nov 25, 2024 •

edited

Loading

illia-malachyn Nov 28, 2024 •

edited

Loading

illia-malachyn Nov 29, 2024 •

edited

Loading

illia-malachyn Nov 29, 2024 •

edited

Loading