-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BCI-2525: check all responses on transaction submission #11599
BCI-2525: check all responses on transaction submission #11599
Conversation
I see that you haven't updated any README files. Would it make sense to do so? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe. Send only nodes might return OK even if the transaction can never be broadcast, e.g. our own internal broadcast node I believe returns OK for everything even impossible transactions.
We must verify a response from a real ethereum node.
What you can try, maybe, is to check responses of primary nodes and use those. It's unsafe to use responses from sendonly nodes as anything conclusive though, they are "fire and forget" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also hv a concern that some SendOnlyNodes may just forward the Tx to a real RPC, and blindly return success, so their response maynot be reliable.
Can't be sure, but I suspect that.
Also, if 2 nodes return incompatible errors, for example, 1 returns Success, and 1 returns Fatal, then we ought to log Critical Errors for that case.
Can we think of changing the overall logic to this:
- Wait until we have responses from atleast 70% of nodes. (not waiting for 100% of nodes allows us to skip really slow nodes which will likely time out)
- Save response category for each node.
- If we have even 1 success, we return back success
- Before returning, we do a validation check. If there's 1 success, then there shouldn't be any Fatal(or ExceedsMaxFee/FeeOutOfValidRange/Underpriced). If yes, then we log critical errors, and allow this error to be shown in health monitors. There's likely a deeper misconfiguration, or bug somewhere.
- If any error was a Unknown type, also log critical error. We shouldn't ever get this.
common/client/multi_node.go
Outdated
}(n) | ||
}) | ||
if !ok { | ||
c.lggr.Debug("Cannot send transaction on sendonly node; MultiNode is stopped", "node", n.String()) | ||
c.lggr.Debugw("Cannot send transaction on node; MultiNode is stopped", "node", n.String()) | ||
return fmt.Errorf("MulltiNode is stopped: %w", context.Canceled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
common/client/multi_node.go
Outdated
sendOnlyError := c.sendOnlyErrorParser(txErr) | ||
if sendOnlyError != Successful { | ||
c.lggr.Debugw("Node sent transaction", "name", n.String(), "tx", tx, "err", txErr) | ||
isSuccess := c.sendOnlyErrorParser(txErr) == Successful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The name sendOnlyErrorParser is misleading.
Can you please rename it to sendTxErrorParser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to keep the sendOnly
prefix as it emphasizes the distinction between the standard ClassifySendError function and the ClassifySendOnlyError
common/client/multi_node.go
Outdated
c.lggr.Warnw("RPC returned error", "name", n.String(), "tx", tx, "err", txErr) | ||
} | ||
|
||
if isSuccess || n == main { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like that we still will return error as soon as main node says so, without waiting for other nodes.
Could we wait till other nodes have had a chance to return success?
I don't think its a huge concern if that takes a long time. Or, maybe we could use some heuristics, like, if main node has errored, then atleast wait till 50% of nodes have also errored, before returning that error code back.
Given that we can't rely on send-only nodes at all, since their response can be arbitrary, do we really need to optimize this logic? Do we have any indication that this is currently slow and needs to be optimized? I'd rather have a "dumb" code that is 95% as fast as a super complicated optimized code. |
…of github.com:smartcontractkit/chainlink into feature/BCI-2525-send-transaction-check-all-responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very solid!
common/client/multi_node.go
Outdated
c.lggr.Criticalw(errMsg, "tx", tx, "resultsByCode", resultsByCode) | ||
err := fmt.Errorf(errMsg) | ||
c.SvcErrBuffer.Append(err) | ||
return severeErrors[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return success here, since atleast 1 node has accepted our broadcasted Tx, and thus it can now be included onchain.
We shouldn't throw away this signal.
Regarding manual intervention, you are already logging a Critical Log, that implies someone should be investigating it manually.
In future, when we create a proper health dashboard, this error should flow in to that dashboard as a critical issue, with all details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider supplementing with a prom metric (or adding prom metrics to this more generally) since that is more likely to get alerted/actioned than a critical log
@@ -73,7 +73,10 @@ func NewChainClient( | |||
chainID, | |||
chainType, | |||
"EVM", | |||
ClassifySendOnlyError, | |||
func(tx *types.Transaction, err error) commonclient.SendTxReturnCode { | |||
return ClassifySendError(err, logger.Sugared(logger.Nop()), tx, common.Address{}, chainType.IsL2()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass the existing logger param instead of Nop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing proper logger would result in partial logs that duplicate proper ones.
ClassifySendError
is called inside the SendTransaction
, that is called by SendTransactionReturnCode
, which perform logging on its own.
ClassifySendError
called from the SendTransaction
does not have access to the address of the source account.
common/client/multi_node.go
Outdated
go func() { | ||
wg.Wait() | ||
c.wg.Done() | ||
close(inTxResults) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better if you call close before c.wg.Done()
since Close()
shouldn't proceed unless this routine has fully returned.
common/client/multi_node.go
Outdated
} | ||
|
||
txResultsToReport := make(chan sendTxResult, len(c.nodes)) | ||
go fanOut(inTxResults, txResultsToReport, txResults) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if broadcastTxAsync
is already over at this point and someone calls Close(), would we have a leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fanOut
is not leaking, because if broadcastTxAsync
is already over - inTxResults is closed. Both txResultsToReport
, txResults
are sufficiently buffered, so we won't stuck even if no one is reading the results.
Also Close
can not complete before fanOut
is done as reportSendTxAnomalies
, that is protected by a wait group, waits for txResultsToReport
to be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not suggesting that it will get stuck, I'm saying there is a rare case the multinode can close before the fanOut method returns. The fanOut method will still close eventually, perhaps almost instantly after the mutlinode close, but theoretically you can have a scenario where the parent routine exits before the child routine, which might get picked up by random tests, so perhaps it should be better if we do c.wg.Add(2)
at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense. It might be better to make it more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhaidashenko another approach I drafted could look like this
ok := c.IfNotStopped(func() {
c.wg.Add(len(c.sendonlys))
// fire-n-forget, as sendOnlyNodes can not be trusted with result reporting
for _, n := range c.sendonlys {
go func(n SendOnlyNode[CHAIN_ID, RPC_CLIENT]) {
defer c.wg.Done()
c.broadcastTx(ctx, n, tx)
}(n)
}
// signal when all the primary nodes done broadcasting tx
txResultsToReport := make(chan sendTxResult, len(c.nodes))
c.wg.Add(1)
var wg sync.WaitGroup
wg.Add(len(c.nodes))
for _, n := range c.nodes {
go func(n SendOnlyNode[CHAIN_ID, RPC_CLIENT]) {
defer wg.Done()
resultCode, txErr := c.broadcastTx(ctx, n, tx)
txResults <- sendTxResult{Err: txErr, ResultCode: resultCode}
txResultsToReport <- sendTxResult{Err: txErr, ResultCode: resultCode}
}(n)
}
go func() {
wg.Wait()
close(txResults)
close(txResultsToReport)
c.wg.Done()
}()
c.wg.Add(1)
go c.reportSendTxAnomalies(tx, txResultsToReport)
})
and broadcastTx
will simply return the results without the channels. This way we remove the complexity from fanout
. If what you've written seems more comprehensive, I'm fine either way.
common/client/multi_node.go
Outdated
txErr := n.RPC().SendTransaction(ctx, tx) | ||
c.lggr.Debugw("Node sent transaction", "name", n.String(), "tx", tx, "err", txErr) | ||
resultCode := c.classifySendTxError(tx, txErr) | ||
if resultCode != Successful && resultCode != TransactionAlreadyKnown { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you just added a new list "sendTxSuccessfulCodes", this check should just change to checking if resultCode is in this list.
}) | ||
t.Run("Fails when closed on sendonly broadcast", func(t *testing.T) { | ||
t.Run("Returns success without waiting for the rest of the nodes", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you ensure that we call mn.Close() at the end of all tests?
This will help test that the MultiNode's waitgroups have all ended gracefully, especially with more dependencies added now on the wg field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the subtests explicitly close multinode at the end of case or we do it on cleanup:
newStartedMultiNode := func(t *testing.T, opts multiNodeOpts) testMultiNode {
mn := newTestMultiNode(t, opts)
err := mn.StartOnce("startedTestMultiNode", func() error { return nil })
require.NoError(t, err)
t.Cleanup(func() {
require.NoError(t, mn.Close())
})
return mn
}
return err, err | ||
} | ||
|
||
const sendTxQuorum = 0.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on this const?
Quality Gate passedIssues Measures |
* sendTx: signal success if one of the nodes accepted transaction * fix logger * fix merge * fix race * fixed multinode tests race * improve test coverage * WIP: wait for 70% of nodes to reply on send TX * tests * Report invariant violation via prom metrics * fixed sendTx tests * address comments * polish PR * Describe implementation details in the comment to SendTransaction * nit fixes * more fixes * use softTimeOut default value * nit fix * ensure all goroutines are done before Close * refactor broadcast * use sendTxSuccessfulCodes slice to identify if result is successful --------- Co-authored-by: Prashant Yadav <[email protected]>
Address slow SendTransaction execution and unnecessary retries caused by slow/unhealthy active node.