-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reapply #8644 #9242
base: master
Are you sure you want to change the base?
Reapply #8644 #9242
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
6065746
to
27144ba
Compare
Waiting to push fix commit until CI completes for the reapplication. |
7a40c4a
to
1e7b192
Compare
Looks like there are still a couple of itests failing. Will keep working on this next week. |
The error message |
This looks relevant, re some of the errors I see in the latest CI run: https://stackoverflow.com/a/42303225 |
Perhaps part of the issue is with the
Based on the SO link above, we might also be lacking some needed indexes. |
With closing the channel and a couple of other tests, I'm seeing logs similar to:
when I reproduce locally, as well as in the CI logs. I'm going to pull on that thread first... On the test config side, also seeing these:
I think the first issue above is with the code, the second is a config issue, and the other config issue in my comment above are the three major failures still happening. I think the |
This looks like a case where we |
Yep, looking into why that isn't caught by the panic/recover mechanism. |
It was actually a lack of error checking in |
1e67a84
to
899ae59
Compare
Looks better as far as the errors on closing channels. Will keep working tomorrow to eliminate the other errors. |
Hmm, so we don't have great visibility into how much memory these CI machines have. Perhaps we need to modify the connection settings to reduce the number of active connections, and also tune params like @djkazic has been working on a postgres+lnd tuning/perf guide, that I think we can eventually check directly into lnd. |
This is also very funky: lnd/kvdb/sqlbase/readwrite_bucket.go Lines 336 to 363 in e3cc4d7
We do two queries to just delete: select to see if exists, then delete. Instead of just trying to delete. Stepping back a minute: perhaps the issue is with this flawed KV abstraction we have. Perhaps we should just re-create a better hierarchical KV table from scratch. We use |
Here's another instance of duplicated work in lnd/kvdb/sqlbase/readwrite_bucket.go Lines 149 to 187 in e3cc4d7
We select to see if it exists, then potentially do the insert again. Instead, we can just do an |
I think the way the sequence is implemented may also be problematic: we have the sequence field directly in the table, which means table locks may need to be held. The sequence gets incremented a lot for stuff like payments, or invoice. We may be able to instead split that out into another table that can be updated independently of the main table: lnd/kvdb/sqlbase/readwrite_bucket.go Lines 412 to 437 in e3cc4d7
|
I've been able to reduce (but not fully eliminate) the I've also tried treating these errors and In addition, I've found one more place where we get the I pushed these changes above for discussion. My next step is to try to reduce the number of conflicts based on @Roasbeef's suggestions above. I'm going on vacation for the rest of the week until next Tuesday, so will keep working on this then. |
I think treating the OOM errors as serialization errors ended up being a mistake. Going to take that out and push when this run is done. In addition, I'm trying doubling the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clean this up shortly, but responding to a few comments.
Note that the failure in CI from the previous push is the same deadlock we've seen before in htlc_timeout_resolver_extract_preimage_(remote|local)
but in a different test, so it didn't end up showing the goroutine dump. But I think I've tracked it down and will submit a PR to btcwallet to fix it. I think I have a way to track it down, but am still working on it. It's definitely in waddrmgr.
batch/batch_postgres.go
Outdated
|
||
// Apply each request in the batch in its own transaction. Requests that | ||
// fail will be retried by the caller. | ||
for _, req := range b.reqs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be able to take this commit out altogether, will check to see after I've fixed the last deadlock I'm working on now. Otherwise, I'll refactor to just skip the batch scheduler for postgres, should be simpler.
@@ -21,6 +21,7 @@ var ( | |||
postgresErrMsgs = []string{ | |||
"could not serialize access", | |||
"current transaction is aborted", | |||
"not enough elements in RWConflictPool", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right but I think we can retry this one specifically whereas the out of shared memory
error tends to be less retriable. That's why I'm not detecting the error code, but only the string.
sweep/sweeper.go
Outdated
@@ -1624,14 +1625,25 @@ func (s *UtxoSweeper) monitorFeeBumpResult(resultChan <-chan *BumpResult) { | |||
} | |||
|
|||
case <-s.quit: | |||
log.Debugf("Sweeper shutting down, exit fee " + | |||
"bump handler") | |||
log.Debugf("Sweeper shutting down, exit fee "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was hoping to get this deadlock in CI, but the deadlock happened in another test that didn't produce this output.
I'm able to reproduce the deadlock and think I've figured out how it happens. Running some tests to ensure it's fixed, then if it stays good, will submit a small PR to btcwallet with the fix. I lied, still working on a fix.
lncfg/db.go
Outdated
@@ -38,7 +38,7 @@ const ( | |||
SqliteBackend = "sqlite" | |||
DefaultBatchCommitInterval = 500 * time.Millisecond | |||
|
|||
defaultPostgresMaxConnections = 50 | |||
defaultPostgresMaxConnections = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check and see if I can tune to be OK for 50.
Re the shared memory issue, I think we can get around that by bumping up the size of the CI instance we use for these postgres tests: https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/running-jobs-on-larger-runners |
I think I've found the deadlock. With more than one DB transaction allowed in parallel for btcwallet, we're running into a deadlock similar to the following. This example is from the UTXO sweeper tests, but it can happen in other situations as well. In one goroutine, the UTXO sweeper calls
So while the top-level In another goroutine, we see the sweeper call
So the sequence in this case is that the This has previously been mitigated by the fact that each of these happens inside a database transaction, which never ran in parallel. However, with parallel DB transactions made possible by this change, the inner deadlock is exposed. I'll submit a PR next week to btcwallet to fix this, and then clean up this PR/respond to the comments above. |
I've submitted btcsuite/btcwallet#967 to fix the deadlock mentioned above. |
Could you share the code line that does this? Tried to find it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love all the in-depth analysis. Just wanna mention it may be helpful to run this on top of #9260 to reduce some of the noise created from the itest flakes so the results can be more indicative. I ran it once here and got a failure in postgres DB backend.
The other thing is I didn't realize how easily it is to trigger a serialization error in postgres. By simply running list_channels
locally,
make itest icase=list_channels backend=bitcoind dbbackend=postgres nativesql=true
I got a concurrent update in channeldb
,
2024-11-26 11:13:39.317 UTC [89] STATEMENT: INSERT INTO channeldb_kv (key, value, parent_id) VALUES($1, $2, $3) ON CONFLICT (key, parent_id) WHERE parent_id IS NOT NULL DO UPDATE SET value=$2 WHERE channeldb_kv.value IS NOT NULL
2024-11-26 11:13:39.379 UTC [62] ERROR: could not serialize access due to concurrent update
@@ -221,11 +221,11 @@ func testMultiHopPayments(ht *lntest.HarnessTest) { | |||
// Dave and Alice should both have forwards and settles for | |||
// their role as forwarding nodes. | |||
ht.AssertHtlcEvents( | |||
daveEvents, numPayments, 0, numPayments, 0, | |||
daveEvents, numPayments, 0, numPayments*2, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain in the commit msg about why we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I'll condense the explanation from here.
Makefile
Outdated
# each can run concurrently. Note that many of the settings here are | ||
# specifically for integration testing and are not fit for running | ||
# production nodes. | ||
docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we are using a single postgres instance for all the tranches...this is probably wrong as we should use independent DB and chain backend for each tranche as they are supposed to be isolated...I need to think about how to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use a single postgres instance for now and fix this in a follow-up, since I've got it somewhat tuned for this? Otherwise we'll definitely have more overhead on the runners, even if we turn down resource use per DB container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah totally, no need to block on this since it's pre-existing.
Makefile
Outdated
# specifically for integration testing and are not fit for running | ||
# production nodes. | ||
docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off | ||
docker logs -f lnd-postgres >itest/postgres-log 2>&1 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we name this postgres.log
instead and add it to the gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started it as named postgres.log
but the itest-only
target, which runs db-instance
first, does this:
rm -rf itest/*.log itest/.logs-*; date
This means that the log file is created by db-instance
and then immediately deleted just as the tests are starting. If you tail -f
the log file, you can still see it, but it's not there for analysis afterwards.
I'll make the above line a separate target (maybe clean-itest-logs
) and run it before db-instance
. Then we can have the file named postgres.log
. Note that this will also have to be changed around if/when we move to a DB instance per tranche.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool yeah let's fix it when we move DB instance per tranche, just FYI we usually do logs cleanup here,
lnd/lntest/node/harness_node.go
Lines 669 to 674 in c8cfa59
// Make sure log file is closed and renamed if necessary. | |
finalizeLogfile(hn) | |
// Rename the etcd.log file if the node was running on embedded | |
// etcd. | |
finalizeEtcdLog(hn) |
Sure:
lnd/lnwallet/btcwallet/btcwallet.go Line 1375 in fbeab72
Inside that, it calls
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's really easy to get DB serialization errors even as the node just does its background work synchronizing the chain when we allow multiple transactions to happen at the same time. Biggest reason is the INSERT
statements, which assign a new value in the id
column to the new row. This makes it impossible to serialize the transactions since the IDs assigned are order dependent.
Makefile
Outdated
# each can run concurrently. Note that many of the settings here are | ||
# specifically for integration testing and are not fit for running | ||
# production nodes. | ||
docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use a single postgres instance for now and fix this in a follow-up, since I've got it somewhat tuned for this? Otherwise we'll definitely have more overhead on the runners, even if we turn down resource use per DB container.
Makefile
Outdated
# specifically for integration testing and are not fit for running | ||
# production nodes. | ||
docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off | ||
docker logs -f lnd-postgres >itest/postgres-log 2>&1 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started it as named postgres.log
but the itest-only
target, which runs db-instance
first, does this:
rm -rf itest/*.log itest/.logs-*; date
This means that the log file is created by db-instance
and then immediately deleted just as the tests are starting. If you tail -f
the log file, you can still see it, but it's not there for analysis afterwards.
I'll make the above line a separate target (maybe clean-itest-logs
) and run it before db-instance
. Then we can have the file named postgres.log
. Note that this will also have to be changed around if/when we move to a DB instance per tranche.
@@ -221,11 +221,11 @@ func testMultiHopPayments(ht *lntest.HarnessTest) { | |||
// Dave and Alice should both have forwards and settles for | |||
// their role as forwarding nodes. | |||
ht.AssertHtlcEvents( | |||
daveEvents, numPayments, 0, numPayments, 0, | |||
daveEvents, numPayments, 0, numPayments*2, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I'll condense the explanation from here.
kvdb/sqlbase/db.go
Outdated
return catchPanic(func() error { return f(kvTx) }) | ||
|
||
err := f(kvTx) | ||
// Return the internal error first in case we need to retry and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix, thanks!
@@ -17,6 +17,11 @@ var ( | |||
// ErrRetriesExceeded is returned when a transaction is retried more | |||
// than the max allowed valued without a success. | |||
ErrRetriesExceeded = errors.New("db tx retries exceeded") | |||
|
|||
postgresErrMsgs = []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this as well, thanks!
I've cleaned this up a bit, responded to most of the comments in the code, and am now testing the DB tuning some more (maxconnections setting vs. container startup flags). Also testing eliminating one or two potentially extraneous commits to see if things still work after the latest bugfixes I've added. Once I have that working, I'll rebase on #9260 and force-push, hopefully later this afternoon. I'd still expect failures on blockchain sync until btcsuite/btcwallet#967 is merged/tagged and I can reference it from Looking forward to winding this up and getting it merged! |
c9dca65
to
9bdadf7
Compare
Huh, all the itests failed. It seems to be working for me locally. I see the problem now. I'll rebase on master again, and submit another PR from here to the other branch. |
9bdadf7
to
f9cdf92
Compare
I've rebased on |
This reverts commit 67419a7.
To make this itest work reliably with multiple parallel SQL transactions, we need to count both the settle and final HTLC events. Otherwise, sometimes the final events from earlier forwards are counted before the forward events from later forwards, causing a miscount of the settle events. If we expect both the settle and final event for each forward, we don't miscount.
a332e87
to
b218301
Compare
b218301
to
689d5dd
Compare
Doing updates in #9313 now, though I can go back to here if that's preferred. |
Change Description
Fix #9229 by reapplying #8644 and
batch
packagebatch
requests into their own transactions for postgres db backend to reduce serialization errorschanneldb
packagecurrent transaction is aborted
errors as serialization errors in case we hit a serialization error and ignore it, and get this error in a subsequent call to postgresdb-instance
postgres flags inMakefile
per @djkazic's recommendationsmaxconnections
parameter for postgres DBs to 20 instead of 50 by defaultSteps to Test
See the failing itests prior to the fix, and the passing itests after the fix.
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]
in the commit message for small changes.📝 Please see our Contribution Guidelines for further guidance.