BCI-2525: check all responses on transaction submission #11599

dhaidashenko · 2023-12-18T14:28:02Z

Address slow SendTransaction execution and unnecessary retries caused by slow/unhealthy active node.

github-actions · 2023-12-18T14:28:24Z

I see that you haven't updated any README files. Would it make sense to do so?

…ll-responses

samsondav

This is not safe. Send only nodes might return OK even if the transaction can never be broadcast, e.g. our own internal broadcast node I believe returns OK for everything even impossible transactions.

We must verify a response from a real ethereum node.

samsondav · 2023-12-19T17:41:23Z

What you can try, maybe, is to check responses of primary nodes and use those. It's unsafe to use responses from sendonly nodes as anything conclusive though, they are "fire and forget"

prashantkumar1982

I also hv a concern that some SendOnlyNodes may just forward the Tx to a real RPC, and blindly return success, so their response maynot be reliable.
Can't be sure, but I suspect that.

Also, if 2 nodes return incompatible errors, for example, 1 returns Success, and 1 returns Fatal, then we ought to log Critical Errors for that case.

Can we think of changing the overall logic to this:

Wait until we have responses from atleast 70% of nodes. (not waiting for 100% of nodes allows us to skip really slow nodes which will likely time out)
Save response category for each node.
If we have even 1 success, we return back success
Before returning, we do a validation check. If there's 1 success, then there shouldn't be any Fatal(or ExceedsMaxFee/FeeOutOfValidRange/Underpriced). If yes, then we log critical errors, and allow this error to be shown in health monitors. There's likely a deeper misconfiguration, or bug somewhere.
If any error was a Unknown type, also log critical error. We shouldn't ever get this.

prashantkumar1982 · 2023-12-19T18:50:14Z

common/client/multi_node.go

 			}(n)
 		})
 		if !ok {
-			c.lggr.Debug("Cannot send transaction on sendonly node; MultiNode is stopped", "node", n.String())
+			c.lggr.Debugw("Cannot send transaction on node; MultiNode is stopped", "node", n.String())
+			return fmt.Errorf("MulltiNode is stopped: %w", context.Canceled)


nice catch!

prashantkumar1982 · 2023-12-19T19:00:44Z

common/client/multi_node.go

-				sendOnlyError := c.sendOnlyErrorParser(txErr)
-				if sendOnlyError != Successful {
+				c.lggr.Debugw("Node sent transaction", "name", n.String(), "tx", tx, "err", txErr)
+				isSuccess := c.sendOnlyErrorParser(txErr) == Successful


nit: The name sendOnlyErrorParser is misleading.
Can you please rename it to sendTxErrorParser.

I think it's important to keep the sendOnly prefix as it emphasizes the distinction between the standard ClassifySendError function and the ClassifySendOnlyError

prashantkumar1982 · 2023-12-19T19:06:30Z

common/client/multi_node.go

 					c.lggr.Warnw("RPC returned error", "name", n.String(), "tx", tx, "err", txErr)
 				}
+
+				if isSuccess || n == main {


I don't really like that we still will return error as soon as main node says so, without waiting for other nodes.
Could we wait till other nodes have had a chance to return success?
I don't think its a huge concern if that takes a long time. Or, maybe we could use some heuristics, like, if main node has errored, then atleast wait till 50% of nodes have also errored, before returning that error code back.

dimriou · 2023-12-20T13:10:03Z

I also hv a concern that some SendOnlyNodes may just forward the Tx to a real RPC, and blindly return success, so their response maynot be reliable. Can't be sure, but I suspect that.

Also, if 2 nodes return incompatible errors, for example, 1 returns Success, and 1 returns Fatal, then we ought to log Critical Errors for that case.

Can we think of changing the overall logic to this:

Wait until we have responses from atleast 70% of nodes. (not waiting for 100% of nodes allows us to skip really slow nodes which will likely time out)

Save response category for each node.

If we have even 1 success, we return back success

Before returning, we do a validation check. If there's 1 success, then there shouldn't be any Fatal(or ExceedsMaxFee/FeeOutOfValidRange/Underpriced). If yes, then we log critical errors, and allow this error to be shown in health monitors. There's likely a deeper misconfiguration, or bug somewhere.

If any error was a Unknown type, also log critical error. We shouldn't ever get this.

Given that we can't rely on send-only nodes at all, since their response can be arbitrary, do we really need to optimize this logic? Do we have any indication that this is currently slow and needs to be optimized? I'd rather have a "dumb" code that is 95% as fast as a super complicated optimized code.

…of github.com:smartcontractkit/chainlink into feature/BCI-2525-send-transaction-check-all-responses

…ll-responses

prashantkumar1982

Looks very solid!

prashantkumar1982 · 2023-12-27T02:00:04Z

common/client/multi_node.go

+			c.lggr.Criticalw(errMsg, "tx", tx, "resultsByCode", resultsByCode)
+			err := fmt.Errorf(errMsg)
+			c.SvcErrBuffer.Append(err)
+			return severeErrors[0]


I think we should return success here, since atleast 1 node has accepted our broadcasted Tx, and thus it can now be included onchain.
We shouldn't throw away this signal.

Regarding manual intervention, you are already logging a Critical Log, that implies someone should be investigating it manually.
In future, when we create a proper health dashboard, this error should flow in to that dashboard as a critical issue, with all details.

Consider supplementing with a prom metric (or adding prom metrics to this more generally) since that is more likely to get alerted/actioned than a critical log

dimriou · 2024-01-15T10:26:13Z

core/chains/evm/client/chain_client.go

@@ -73,7 +73,10 @@ func NewChainClient(
 		chainID,
 		chainType,
 		"EVM",
-		ClassifySendOnlyError,
+		func(tx *types.Transaction, err error) commonclient.SendTxReturnCode {
+			return ClassifySendError(err, logger.Sugared(logger.Nop()), tx, common.Address{}, chainType.IsL2())


Why not pass the existing logger param instead of Nop?

Passing proper logger would result in partial logs that duplicate proper ones.
ClassifySendError is called inside the SendTransaction, that is called by SendTransactionReturnCode, which perform logging on its own.
ClassifySendError called from the SendTransaction does not have access to the address of the source account.

dimriou · 2024-01-15T11:02:00Z

common/client/multi_node.go

+		go func() {
+			wg.Wait()
+			c.wg.Done()
+			close(inTxResults)


Better if you call close before c.wg.Done() since Close() shouldn't proceed unless this routine has fully returned.

dimriou · 2024-01-15T11:07:53Z

common/client/multi_node.go

 		}
+
+		txResultsToReport := make(chan sendTxResult, len(c.nodes))
+		go fanOut(inTxResults, txResultsToReport, txResults)


What if broadcastTxAsync is already over at this point and someone calls Close(), would we have a leak?

fanOut is not leaking, because if broadcastTxAsync is already over - inTxResults is closed. Both txResultsToReport, txResults are sufficiently buffered, so we won't stuck even if no one is reading the results.
Also Close can not complete before fanOut is done as reportSendTxAnomalies, that is protected by a wait group, waits for txResultsToReport to be closed.

I'm not suggesting that it will get stuck, I'm saying there is a rare case the multinode can close before the fanOut method returns. The fanOut method will still close eventually, perhaps almost instantly after the mutlinode close, but theoretically you can have a scenario where the parent routine exits before the child routine, which might get picked up by random tests, so perhaps it should be better if we do c.wg.Add(2) at the beginning.

Yes, makes sense. It might be better to make it more explicit

dimriou

@dhaidashenko another approach I drafted could look like this

ok := c.IfNotStopped(func() {
	c.wg.Add(len(c.sendonlys))
	// fire-n-forget, as sendOnlyNodes can not be trusted with result reporting
	for _, n := range c.sendonlys {
		go func(n SendOnlyNode[CHAIN_ID, RPC_CLIENT]) {
			defer c.wg.Done()
			c.broadcastTx(ctx, n, tx)
		}(n)
	}

	// signal when all the primary nodes done broadcasting tx
	txResultsToReport := make(chan sendTxResult, len(c.nodes))
	c.wg.Add(1)
	var wg sync.WaitGroup
	wg.Add(len(c.nodes))
	for _, n := range c.nodes {
		go func(n SendOnlyNode[CHAIN_ID, RPC_CLIENT]) {
			defer wg.Done()
			resultCode, txErr := c.broadcastTx(ctx, n, tx)

			txResults <- sendTxResult{Err: txErr, ResultCode: resultCode}
			txResultsToReport <- sendTxResult{Err: txErr, ResultCode: resultCode}
		}(n)
	}
	go func() {
		wg.Wait()
		close(txResults)
		close(txResultsToReport)
		c.wg.Done()
	}()

	c.wg.Add(1)
	go c.reportSendTxAnomalies(tx, txResultsToReport)
})

and broadcastTx will simply return the results without the channels. This way we remove the complexity from fanout. If what you've written seems more comprehensive, I'm fine either way.

prashantkumar1982 · 2024-01-22T21:58:17Z

common/client/multi_node.go

+	txErr := n.RPC().SendTransaction(ctx, tx)
+	c.lggr.Debugw("Node sent transaction", "name", n.String(), "tx", tx, "err", txErr)
+	resultCode := c.classifySendTxError(tx, txErr)
+	if resultCode != Successful && resultCode != TransactionAlreadyKnown {


Since you just added a new list "sendTxSuccessfulCodes", this check should just change to checking if resultCode is in this list.

prashantkumar1982 · 2024-01-22T22:33:47Z

common/client/multi_node_test.go

 	})
-	t.Run("Fails when closed on sendonly broadcast", func(t *testing.T) {
+	t.Run("Returns success without waiting for the rest of the nodes", func(t *testing.T) {


Could you ensure that we call mn.Close() at the end of all tests?
This will help test that the MultiNode's waitgroups have all ended gracefully, especially with more dependencies added now on the wg field.

All of the subtests explicitly close multinode at the end of case or we do it on cleanup:

newStartedMultiNode := func(t *testing.T, opts multiNodeOpts) testMultiNode { mn := newTestMultiNode(t, opts) err := mn.StartOnce("startedTestMultiNode", func() error { return nil }) require.NoError(t, err) t.Cleanup(func() { require.NoError(t, mn.Close()) }) return mn }

…ll-responses

samsondav · 2024-02-09T13:44:17Z

common/client/multi_node.go

+	return err, err
+}
+
+const sendTxQuorum = 0.7


comment on this const?

…ll-responses

cl-sonarqube-production · 2024-02-09T22:35:48Z

Quality Gate passed

Issues
1 New issue
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
96.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

* sendTx: signal success if one of the nodes accepted transaction * fix logger * fix merge * fix race * fixed multinode tests race * improve test coverage * WIP: wait for 70% of nodes to reply on send TX * tests * Report invariant violation via prom metrics * fixed sendTx tests * address comments * polish PR * Describe implementation details in the comment to SendTransaction * nit fixes * more fixes * use softTimeOut default value * nit fix * ensure all goroutines are done before Close * refactor broadcast * use sendTxSuccessfulCodes slice to identify if result is successful --------- Co-authored-by: Prashant Yadav <[email protected]>

sendTx: signal success if one of the nodes accepted transaction

41cf0be

dhaidashenko and others added 7 commits December 18, 2023 15:30

fix logger

7bdce25

fix merge

7274949

fix race

6c5915a

fixed multinode tests race

86104a2

Merge branch 'develop' into feature/BCI-2525-send-transaction-check-a…

ed88f18

…ll-responses

improve test coverage

6768c7e

Merge branch 'develop' into feature/BCI-2525-send-transaction-check-a…

fc41b06

…ll-responses

dhaidashenko marked this pull request as ready for review December 19, 2023 16:35

dhaidashenko requested review from samsondav, prashantkumar1982 and a team as code owners December 19, 2023 16:35

jmank88 previously approved these changes Dec 19, 2023

View reviewed changes

samsondav requested changes Dec 19, 2023

View reviewed changes

prashantkumar1982 reviewed Dec 19, 2023

View reviewed changes

dhaidashenko marked this pull request as draft December 21, 2023 17:34

dhaidashenko added 2 commits December 21, 2023 18:36

WIP: wait for 70% of nodes to reply on send TX

e2767df

Merge branch 'feature/BCI-2525-send-transaction-check-all-responses' …

b27afc9

…of github.com:smartcontractkit/chainlink into feature/BCI-2525-send-transaction-check-all-responses

dhaidashenko dismissed jmank88’s stale review via b27afc9 December 21, 2023 17:37

dhaidashenko and others added 2 commits December 22, 2023 18:57

tests

5db54d7

Merge branch 'develop' into feature/BCI-2525-send-transaction-check-a…

169a90b

…ll-responses

dhaidashenko marked this pull request as ready for review December 22, 2023 19:02

dhaidashenko requested review from samsondav, prashantkumar1982, jmank88 and dimriou December 22, 2023 19:02

prashantkumar1982 reviewed Dec 27, 2023

View reviewed changes

dhaidashenko added 3 commits January 11, 2024 13:44

nit fixes

d4268af

more fixes

d43dc50

use softTimeOut default value

5ed9ab7

dimriou reviewed Jan 15, 2024

View reviewed changes

nit fix

6369972

dhaidashenko temporarily deployed to sdlc January 15, 2024 16:24 — with GitHub Actions Inactive

ensure all goroutines are done before Close

8256c5d

dhaidashenko temporarily deployed to sdlc January 16, 2024 13:08 — with GitHub Actions Inactive

dimriou reviewed Jan 16, 2024

View reviewed changes

refactor broadcast

e376873

dhaidashenko temporarily deployed to sdlc January 17, 2024 16:54 — with GitHub Actions Inactive

dhaidashenko requested a review from dimriou January 22, 2024 17:42

prashantkumar1982 reviewed Jan 22, 2024

View reviewed changes

use sendTxSuccessfulCodes slice to identify if result is successful

9a85e97

dhaidashenko temporarily deployed to sdlc February 8, 2024 12:57 — with GitHub Actions Inactive

Merge branch 'develop' into feature/BCI-2525-send-transaction-check-a…

46f0fd3

…ll-responses

dhaidashenko temporarily deployed to sdlc February 8, 2024 13:23 — with GitHub Actions Inactive

dhaidashenko requested a review from prashantkumar1982 February 8, 2024 14:19

prashantkumar1982 approved these changes Feb 9, 2024

View reviewed changes

samsondav approved these changes Feb 9, 2024

View reviewed changes

common/client/multi_node.go

return err, err

}

const sendTxQuorum = 0.7

Copy link

Collaborator

samsondav Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on this const?

Merge branch 'develop' into feature/BCI-2525-send-transaction-check-a…

0ee678f

…ll-responses

prashantkumar1982 temporarily deployed to sdlc February 9, 2024 22:22 — with GitHub Actions Inactive

prashantkumar1982 added this pull request to the merge queue Feb 9, 2024

Merged via the queue into develop with commit 556a4f3 Feb 9, 2024
93 checks passed

prashantkumar1982 deleted the feature/BCI-2525-send-transaction-check-all-responses branch February 9, 2024 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BCI-2525: check all responses on transaction submission #11599

BCI-2525: check all responses on transaction submission #11599

dhaidashenko commented Dec 18, 2023 •

edited

Loading

github-actions bot commented Dec 18, 2023

samsondav left a comment

samsondav commented Dec 19, 2023

prashantkumar1982 left a comment

prashantkumar1982 Dec 19, 2023

prashantkumar1982 Dec 19, 2023

dimriou Dec 20, 2023

prashantkumar1982 Dec 19, 2023

dimriou commented Dec 20, 2023

prashantkumar1982 left a comment

prashantkumar1982 Dec 27, 2023

samsondav Jan 2, 2024

dimriou Jan 15, 2024

dhaidashenko Jan 15, 2024

dimriou Jan 15, 2024

dimriou Jan 15, 2024

dhaidashenko Jan 15, 2024

dimriou Jan 16, 2024

dhaidashenko Jan 16, 2024 •

edited

Loading

dimriou left a comment

prashantkumar1982 Jan 22, 2024

prashantkumar1982 Jan 22, 2024

dhaidashenko Feb 8, 2024

samsondav Feb 9, 2024

cl-sonarqube-production bot commented Feb 9, 2024

BCI-2525: check all responses on transaction submission #11599

BCI-2525: check all responses on transaction submission #11599

Conversation

dhaidashenko commented Dec 18, 2023 • edited Loading

github-actions bot commented Dec 18, 2023

samsondav left a comment

Choose a reason for hiding this comment

samsondav commented Dec 19, 2023

prashantkumar1982 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimriou commented Dec 20, 2023

prashantkumar1982 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhaidashenko Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

dimriou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cl-sonarqube-production bot commented Feb 9, 2024

Quality Gate passed

dhaidashenko commented Dec 18, 2023 •

edited

Loading

dhaidashenko Jan 16, 2024 •

edited

Loading