Reapply #8644 #9242

aakselrod · 2024-11-02T03:00:02Z

Change Description

Fix #9229 by reapplying #8644 and

correctly handling serialization errors in the batch package
splitting batch requests into their own transactions for postgres db backend to reduce serialization errors
correctly handling errors that were previously ignored or not passed through in the channeldb package
handling current transaction is aborted errors as serialization errors in case we hit a serialization error and ignore it, and get this error in a subsequent call to postgres
tuning the db-instance postgres flags in Makefile per @djkazic's recommendations
setting the maxconnections parameter for postgres DBs to 20 instead of 50 by default

Steps to Test

See the failing itests prior to the fix, and the passing itests after the fix.

Pull Request Checklist

Testing

Your PR passes all CI checks.
Tests covering the positive and negative (error paths) are included.
Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

The change obeys the Code Documentation and Commenting guidelines, and lines wrap at 80.
Commits follow the Ideal Git Commit Structure.
Any new logging statements use an appropriate subsystem and logging level.
Any new lncli commands have appropriate tags in the comments for the rpc in the proto file.
There is a change description in the release notes, or [skip ci] in the commit message for small changes.

📝 Please see our Contribution Guidelines for further guidance.

coderabbitai · 2024-11-02T03:00:08Z

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)

llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

aakselrod · 2024-11-02T03:06:16Z

Waiting to push fix commit until CI completes for the reapplication.

aakselrod · 2024-11-02T03:50:31Z

Note the postgres itests fail mostly on opening, announcing, and closing channels. This is due to the use of the batch package to batch announcement update writes, while batch doesn't handle serialization errors correctly.

aakselrod · 2024-11-02T04:28:59Z

Looks like there are still a couple of itests failing. Will keep working on this next week.

aakselrod · 2024-11-02T04:56:34Z

The error message not enough elements in RWConflictPool to record a read/write conflict tells me that postgres might not be running with enough resources to handle the itests with 8644 reapplied.

Roasbeef · 2024-11-04T19:28:50Z

This looks relevant, re some of the errors I see in the latest CI run: https://stackoverflow.com/a/42303225

Roasbeef · 2024-11-04T21:00:46Z

Perhaps part of the issue is with the ON CONFLICT clause in many of the queries:

ON CONFLICT Clause with Partial Indexes: Your use of ON CONFLICT with a partial unique index and a WHERE clause may not be matching the index as expected. This can lead to unexpected behavior in conflict resolution and contribute to serialization failures.
Conflict Resolution: The ON CONFLICT clause relies on unique indexes or constraints to detect conflicts. If the clause doesn't perfectly match an existing unique index or constraint, PostgreSQL cannot efficiently perform conflict resolution, leading to increased chances of serialization failures.
Match ON CONFLICT Clause to Unique Index: Ensure that the ON CONFLICT clause exactly matches the unique index or constraint without additional WHERE clauses that might prevent PostgreSQL from recognizing the conflict.

Based on the SO link above, we might also be lacking some needed indexes.

aakselrod · 2024-11-04T22:21:56Z

With closing the channel and a couple of other tests, I'm seeing logs similar to:

2024-11-01 22:13:15.503 [ERR] GRPH builder.go:1020: unable to prune routing table: unknown postgres error: ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
2024-11-01 22:13:15.504 [ERR] GRPH builder.go:849: unable to prune graph with closed channels: unknown postgres error: ERROR: current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)

when I reproduce locally, as well as in the CI logs. I'm going to pull on that thread first...

On the test config side, also seeing these:

2024-11-02 04:28:59.361 UTC [19976] ERROR:  out of shared memory
2024-11-02 04:28:59.361 UTC [19976] HINT:  You might need to increase max_pred_locks_per_transaction.

I think the first issue above is with the code, the second is a config issue, and the other config issue in my comment above are the three major failures still happening. I think the ON CONFLICT/index audit is also a good idea to ensure we minimize serialization errors, but I'd like to solve at least the code issue with the graph pruning first.

Roasbeef · 2024-11-04T22:41:41Z

2024-11-01 22:13:15.504 [ERR] GRPH builder.go:849: unable to prune graph with closed channels

This looks like a case where we continue when we get an error, instead of checking it for a serialization error, and returning if it is.

aakselrod · 2024-11-05T02:10:00Z

This looks like a case where we continue when we get an error, instead of checking it for a serialization error, and returning if it is.

Yep, looking into why that isn't caught by the panic/recover mechanism.

aakselrod · 2024-11-05T03:10:18Z

It was actually a lack of error checking in delChannelEdgeUnsafe() when calling updateEdgePolicyDisabledIndex(). This caused the serialization error to be ignored, and the next error is the one I pasted above. Submitting a fix commit below.

aakselrod · 2024-11-05T05:21:48Z

Looks better as far as the errors on closing channels. Will keep working tomorrow to eliminate the other errors.

Roasbeef · 2024-11-06T20:54:38Z

2024-11-05 03:38:04.408 UTC [14305] ERROR: out of shared memory
2024-11-05 03:38:04.408 UTC [14305] HINT: You might need to increase max_pred_locks_per_transaction.

Hmm, so we don't have great visibility into how much memory these CI machines have. Perhaps we need to modify the connection settings to reduce the number of active connections, and also tune params like max_pred_locks_per_transaction.

@djkazic has been working on a postgres+lnd tuning/perf guide, that I think we can eventually check directly into lnd.

Roasbeef · 2024-11-06T21:03:45Z

This is also very funky:

lnd/kvdb/sqlbase/readwrite_bucket.go

Lines 336 to 363 in e3cc4d7

    
           // Check to see if a bucket with this key exists. 
        
           var dummy int 
        
           row, cancel := b.tx.QueryRow( 
        
           	"SELECT 1 FROM "+b.table+" WHERE "+parentSelector(b.id)+ 
        
           		" AND key=$1 AND value IS NULL", key, 
        
           ) 
        
           defer cancel() 
        
           err := row.Scan(&dummy) 
        
           switch { 
        
           // No bucket exists, proceed to deletion of the key. 
        
           case err == sql.ErrNoRows: 
        
           case err != nil: 
        
           	return err 
        
           // Bucket exists. 
        
           default: 
        
           	return walletdb.ErrIncompatibleValue 
        
           } 
        
           _, err = b.tx.Exec( 
        
           	"DELETE FROM "+b.table+" WHERE key=$1 AND "+ 
        
           		parentSelector(b.id)+" AND value IS NOT NULL", 
        
           	key, 
        
           ) 
        
           if err != nil { 
        
           	return err 
        
           }

We do two queries to just delete: select to see if exists, then delete. Instead of just trying to delete.

Stepping back a minute: perhaps the issue is with this flawed KV abstraction we have. Perhaps we should just re-create a better hierarchical KV table from scratch. We use sqlc elsewhere so we can gain by having a unified set of light abstractions over what we want to do on the SQL layer.

Roasbeef · 2024-11-06T21:10:01Z

Here's another instance of duplicated work in CreateBucket:

lnd/kvdb/sqlbase/readwrite_bucket.go

Lines 149 to 187 in e3cc4d7

    
           	// Check to see if the bucket already exists. 
        
           	var ( 
        
           		value *[]byte 
        
           		id    int64 
        
           	) 
        
           	row, cancel := b.tx.QueryRow( 
        
           		"SELECT id,value FROM "+b.table+" WHERE "+parentSelector(b.id)+ 
        
           			" AND key=$1", key, 
        
           	) 
        
           	defer cancel() 
        
           	err := row.Scan(&id, &value) 
        
           	switch { 
        
           	case err == sql.ErrNoRows: 
        
           	case err == nil && value == nil: 
        
           		return nil, walletdb.ErrBucketExists 
        
           	case err == nil && value != nil: 
        
           		return nil, walletdb.ErrIncompatibleValue 
        
           	case err != nil: 
        
           		return nil, err 
        
           	} 
        
           	// Bucket does not yet exist, so create it. Postgres will generate a 
        
           	// bucket id for the new bucket. 
        
           	row, cancel = b.tx.QueryRow( 
        
           		"INSERT INTO "+b.table+" (parent_id, key) "+ 
        
           			"VALUES($1, $2) RETURNING id", b.id, key, 
        
           	) 
        
           	defer cancel() 
        
           	err = row.Scan(&id) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	return newReadWriteBucket(b.tx, &id), nil 
        
           }

We select to see if it exists, then potentially do the insert again. Instead, we can just do an UPSERT, then use RETURNING to give us the bucket id and key we care about, so a single query.

Roasbeef · 2024-11-06T21:13:00Z

I think the way the sequence is implemented may also be problematic: we have the sequence field directly in the table, which means table locks may need to be held. The sequence gets incremented a lot for stuff like payments, or invoice. We may be able to instead split that out into another table that can be updated independently of the main table:

lnd/kvdb/sqlbase/readwrite_bucket.go

Lines 412 to 437 in e3cc4d7

    
           // Sequence returns the current sequence number for this bucket without 
        
           // incrementing it. 
        
           func (b *readWriteBucket) Sequence() uint64 { 
        
           	if b.id == nil { 
        
           		panic("sequence not supported on top level bucket") 
        
           	} 
        
           	var seq int64 
        
           	row, cancel := b.tx.QueryRow( 
        
           		"SELECT sequence FROM "+b.table+" WHERE id=$1 "+ 
        
           			"AND sequence IS NOT NULL", 
        
           		b.id, 
        
           	) 
        
           	defer cancel() 
        
           	err := row.Scan(&seq) 
        
           	switch { 
        
           	case err == sql.ErrNoRows: 
        
           		return 0 
        
           	case err != nil: 
        
           		panic(err) 
        
           	} 
        
           	return uint64(seq) 
        
           }

aakselrod · 2024-11-06T22:31:22Z

I've been able to reduce (but not fully eliminate) the out of shared memory and not enough elements in RWConflictPool errors locally by changing the -N parameter of the lnd-postgres container to 200, and changing the default maxconnections value in lnd to 20. This follows from this comment about how the RWConflictPool is allocated.

I've also tried treating these errors and current transaction is aborted as serialization errors, since they generally happen when too many transactions are conflicting, and that seemed to reduce the number of test failures.

In addition, I've found one more place where we get the current transaction is aborted errors due to lack of error handling, and added error handling there.

I pushed these changes above for discussion. My next step is to try to reduce the number of conflicts based on @Roasbeef's suggestions above. I'm going on vacation for the rest of the week until next Tuesday, so will keep working on this then.

aakselrod · 2024-11-06T22:58:00Z

I think treating the OOM errors as serialization errors ended up being a mistake. Going to take that out and push when this run is done. In addition, I'm trying doubling the max_pred_locks_per_transaction value from the default (64->128).

aakselrod

I'll clean this up shortly, but responding to a few comments.

Note that the failure in CI from the previous push is the same deadlock we've seen before in htlc_timeout_resolver_extract_preimage_(remote|local) but in a different test, so it didn't end up showing the goroutine dump. ~~But I think I've tracked it down and will submit a PR to btcwallet to fix it.~~ I think I have a way to track it down, but am still working on it. It's definitely in waddrmgr.

Makefile

aakselrod · 2024-11-22T19:25:46Z

batch/batch_postgres.go

+
+	// Apply each request in the batch in its own transaction. Requests that
+	// fail will be retried by the caller.
+	for _, req := range b.reqs {


I might be able to take this commit out altogether, will check to see after I've fixed the last deadlock I'm working on now. Otherwise, I'll refactor to just skip the batch scheduler for postgres, should be simpler.

aakselrod · 2024-11-22T19:30:24Z

sqldb/sqlerrors.go

@@ -21,6 +21,7 @@ var (
 	postgresErrMsgs = []string{
 		"could not serialize access",
 		"current transaction is aborted",
+		"not enough elements in RWConflictPool",


That's right but I think we can retry this one specifically whereas the out of shared memory error tends to be less retriable. That's why I'm not detecting the error code, but only the string.

aakselrod · 2024-11-22T19:30:51Z

sweep/sweeper.go

@@ -1624,14 +1625,25 @@ func (s *UtxoSweeper) monitorFeeBumpResult(resultChan <-chan *BumpResult) {
 			}

 		case <-s.quit:
-			log.Debugf("Sweeper shutting down, exit fee " +
-				"bump handler")
+			log.Debugf("Sweeper shutting down, exit fee "+


Yeah, was hoping to get this deadlock in CI, but the deadlock happened in another test that didn't produce this output.

I'm able to reproduce the deadlock and think I've figured out how it happens. ~~Running some tests to ensure it's fixed, then if it stays good, will submit a small PR to btcwallet with the fix.~~ I lied, still working on a fix.

aakselrod · 2024-11-22T20:34:11Z

lncfg/db.go

@@ -38,7 +38,7 @@ const (
 	SqliteBackend              = "sqlite"
 	DefaultBatchCommitInterval = 500 * time.Millisecond

-	defaultPostgresMaxConnections = 50
+	defaultPostgresMaxConnections = 20


Will check and see if I can tune to be OK for 50.

Roasbeef · 2024-11-22T23:22:41Z

Re the shared memory issue, I think we can get around that by bumping up the size of the CI instance we use for these postgres tests: https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/running-jobs-on-larger-runners

aakselrod · 2024-11-23T17:13:45Z

I think I've found the deadlock. With more than one DB transaction allowed in parallel for btcwallet, we're running into a deadlock similar to the following. This example is from the UTXO sweeper tests, but it can happen in other situations as well.

In one goroutine, the UTXO sweeper calls NewAddress() on the wallet. This:

Starts a DB transaction
Calls (*wallet.wallet) newAddress(), which
Calls FetchScopedKeyManager() on the waddrmgr.Manager, which
Calls RLock() and RUnlock() on the waddrmgr.Manager mutex and returns the requested scoped manager, holding no locks
Then it calls NextExternalAddresses() on the waddrmgr.ScopedManager, which
Calls Lock() on the ScopedManager's mutex, defers the Unlock(), and
Calls nextAddresses() on the ScopedManager, which
Calls WatchOnly() on the parent Manager, which
Calls RLock() and RUnlock() on the Manager's mutex

So while the top-level Manager's mutex is locked first, it's immediately unlocked, and then the ScopedManager lock is held and then Manager is locked/unlocked

In another goroutine, we see the sweeper call GetTransactionDetails() on the wallet, which eventually:

Starts a DB transaction
Within that transaction, calls (*waddrmgr.Manager) Address(), which
Calls RLock() on the Manager, and defers the RUnlock()
Iterates through each ScopedManager to look for a matching address, which
Calls RLock() and then RUnlock() on the ScopedManager to check if the requested address is cached, and if not
Calls Lock() and then Unlock() on the ScopedManager to try to load/cache the requested address from the DB

So the sequence in this case is that the Manager lock is held and the ScopedManagers are locked/unlocked.

This has previously been mitigated by the fact that each of these happens inside a database transaction, which never ran in parallel. However, with parallel DB transactions made possible by this change, the inner deadlock is exposed.

I'll submit a PR next week to btcwallet to fix this, and then clean up this PR/respond to the comments above.

aakselrod · 2024-11-25T22:56:36Z

I've submitted btcsuite/btcwallet#967 to fix the deadlock mentioned above.

yyforyongyu · 2024-11-26T06:55:10Z

Within that transaction, calls (*waddrmgr.Manager) Address(), which

Could you share the code line that does this? Tried to find it in GetTransactionDetails but no luck.

yyforyongyu

Love all the in-depth analysis. Just wanna mention it may be helpful to run this on top of #9260 to reduce some of the noise created from the itest flakes so the results can be more indicative. I ran it once here and got a failure in postgres DB backend.

The other thing is I didn't realize how easily it is to trigger a serialization error in postgres. By simply running list_channels locally,

make itest icase=list_channels backend=bitcoind dbbackend=postgres nativesql=true

I got a concurrent update in channeldb,

2024-11-26 11:13:39.317 UTC [89] STATEMENT:  INSERT INTO channeldb_kv (key, value, parent_id) VALUES($1, $2, $3) ON CONFLICT (key, parent_id) WHERE parent_id IS NOT NULL DO UPDATE SET value=$2 WHERE channeldb_kv.value IS NOT NULL
2024-11-26 11:13:39.379 UTC [62] ERROR:  could not serialize access due to concurrent update

yyforyongyu · 2024-11-26T06:16:54Z

itest/lnd_multi-hop-payments_test.go

@@ -221,11 +221,11 @@ func testMultiHopPayments(ht *lntest.HarnessTest) {
 	// Dave and Alice should both have forwards and settles for
 	// their role as forwarding nodes.
 	ht.AssertHtlcEvents(
-		daveEvents, numPayments, 0, numPayments, 0,
+		daveEvents, numPayments, 0, numPayments*2, 0,


can you explain in the commit msg about why we need this change?

Will do. I'll condense the explanation from here.

yyforyongyu · 2024-11-26T11:10:18Z

Makefile

+	# each can run concurrently. Note that many of the settings here are
+	# specifically for integration testing and are not fit for running
+	# production nodes.
+	docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off


I just realized we are using a single postgres instance for all the tranches...this is probably wrong as we should use independent DB and chain backend for each tranche as they are supposed to be isolated...I need to think about how to fix this.

I think it's OK to use a single postgres instance for now and fix this in a follow-up, since I've got it somewhat tuned for this? Otherwise we'll definitely have more overhead on the runners, even if we turn down resource use per DB container.

yeah totally, no need to block on this since it's pre-existing.

yyforyongyu · 2024-11-26T11:10:47Z

Makefile

+	# specifically for integration testing and are not fit for running
+	# production nodes.
+	docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off
+	docker logs -f lnd-postgres >itest/postgres-log 2>&1 &


can we name this postgres.log instead and add it to the gitignore

I started it as named postgres.log but the itest-only target, which runs db-instance first, does this:

rm -rf itest/*.log itest/.logs-*; date

This means that the log file is created by db-instance and then immediately deleted just as the tests are starting. If you tail -f the log file, you can still see it, but it's not there for analysis afterwards.

I'll make the above line a separate target (maybe clean-itest-logs) and run it before db-instance. Then we can have the file named postgres.log. Note that this will also have to be changed around if/when we move to a DB instance per tranche.

cool yeah let's fix it when we move DB instance per tranche, just FYI we usually do logs cleanup here,

lnd/lntest/node/harness_node.go

Lines 669 to 674 in c8cfa59

// Make sure log file is closed and renamed if necessary.

finalizeLogfile(hn)

// Rename the etcd.log file if the node was running on embedded

// etcd.

finalizeEtcdLog(hn)

aakselrod · 2024-11-26T17:30:11Z

Within that transaction, calls (*waddrmgr.Manager) Address(), which

Could you share the code line that does this? Tried to find it in GetTransactionDetails but no luck.

Sure:

GetTransactionDetails calls the underlying GetTransaction:

lnd/lnwallet/btcwallet/btcwallet.go

Line 1375 in fbeab72

tx, err := b.wallet.GetTransaction(*txHash)

GetTransaction starts a DB transaction: https://github.com/btcsuite/btcwallet/blob/66a3aeef6e78751e644a3d03fd856e982e0db4ac/wallet/wallet.go#L2543

Inside that, it calls makeTxSummary: https://github.com/btcsuite/btcwallet/blob/66a3aeef6e78751e644a3d03fd856e982e0db4ac/wallet/wallet.go#L2557

makeTxSummary calls lookupOutputChain: https://github.com/btcsuite/btcwallet/blob/66a3aeef6e78751e644a3d03fd856e982e0db4ac/wallet/notifications.go#L135

lookupOutputChain calls (*waddrmgr.Manager) Address: https://github.com/btcsuite/btcwallet/blob/66a3aeef6e78751e644a3d03fd856e982e0db4ac/wallet/notifications.go#L87

aakselrod

Yeah, it's really easy to get DB serialization errors even as the node just does its background work synchronizing the chain when we allow multiple transactions to happen at the same time. Biggest reason is the INSERT statements, which assign a new value in the id column to the new row. This makes it impossible to serialize the transactions since the IDs assigned are order dependent.

aakselrod · 2024-11-26T17:08:24Z

Makefile

+	# each can run concurrently. Note that many of the settings here are
+	# specifically for integration testing and are not fit for running
+	# production nodes.
+	docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off


I think it's OK to use a single postgres instance for now and fix this in a follow-up, since I've got it somewhat tuned for this? Otherwise we'll definitely have more overhead on the runners, even if we turn down resource use per DB container.

aakselrod · 2024-11-26T17:11:57Z

Makefile

+	# specifically for integration testing and are not fit for running
+	# production nodes.
+	docker run --name lnd-postgres -e POSTGRES_PASSWORD=postgres -p 6432:5432 -d postgres:13-alpine -N 500 -c max_pred_locks_per_transaction=1024 -c max_locks_per_transaction=128 -c jit=off -c work_mem=8MB -c checkpoint_timeout=10min -c enable_seqscan=off
+	docker logs -f lnd-postgres >itest/postgres-log 2>&1 &


I started it as named postgres.log but the itest-only target, which runs db-instance first, does this:

rm -rf itest/*.log itest/.logs-*; date

This means that the log file is created by db-instance and then immediately deleted just as the tests are starting. If you tail -f the log file, you can still see it, but it's not there for analysis afterwards.

I'll make the above line a separate target (maybe clean-itest-logs) and run it before db-instance. Then we can have the file named postgres.log. Note that this will also have to be changed around if/when we move to a DB instance per tranche.

aakselrod · 2024-11-26T17:17:12Z

itest/lnd_multi-hop-payments_test.go

@@ -221,11 +221,11 @@ func testMultiHopPayments(ht *lntest.HarnessTest) {
 	// Dave and Alice should both have forwards and settles for
 	// their role as forwarding nodes.
 	ht.AssertHtlcEvents(
-		daveEvents, numPayments, 0, numPayments, 0,
+		daveEvents, numPayments, 0, numPayments*2, 0,


Will do. I'll condense the explanation from here.

aakselrod · 2024-11-26T17:17:43Z

kvdb/sqlbase/db.go

-		return catchPanic(func() error { return f(kvTx) })
+
+		err := f(kvTx)
+		// Return the internal error first in case we need to retry and


Will fix, thanks!

aakselrod · 2024-11-26T17:18:14Z

sqldb/sqlerrors.go

@@ -17,6 +17,11 @@ var (
 	// ErrRetriesExceeded is returned when a transaction is retried more
 	// than the max allowed valued without a success.
 	ErrRetriesExceeded = errors.New("db tx retries exceeded")
+
+	postgresErrMsgs = []string{


Will fix this as well, thanks!

aakselrod · 2024-11-26T22:01:37Z

I've cleaned this up a bit, responded to most of the comments in the code, and am now testing the DB tuning some more (maxconnections setting vs. container startup flags). Also testing eliminating one or two potentially extraneous commits to see if things still work after the latest bugfixes I've added.

Once I have that working, I'll rebase on #9260 and force-push, hopefully later this afternoon. I'd still expect failures on blockchain sync until btcsuite/btcwallet#967 is merged/tagged and I can reference it from go.mod, but otherwise that seems to fix the final deadlock we saw in CI here.

Looking forward to winding this up and getting it merged!

aakselrod · 2024-11-27T00:41:54Z

Huh, all the itests failed. It seems to be working for me locally. I see the problem now. I'll rebase on master again, and submit another PR from here to the other branch.

aakselrod · 2024-11-27T01:04:29Z

I've rebased on master here. Opening a new PR and rebasing on #9260 - see #9313. Again, needs to include btcsuite/btcwallet#967 to pass consistently. I believe this addresses pretty much all the comments.

This reverts commit 67419a7.

To make this itest work reliably with multiple parallel SQL transactions, we need to count both the settle and final HTLC events. Otherwise, sometimes the final events from earlier forwards are counted before the forward events from later forwards, causing a miscount of the settle events. If we expect both the settle and final event for each forward, we don't miscount.

aakselrod · 2024-11-27T23:26:59Z

Doing updates in #9313 now, though I can go back to here if that's preferred.

aakselrod force-pushed the reapply-8644 branch from 6065746 to 27144ba Compare November 2, 2024 03:05

aakselrod force-pushed the reapply-8644 branch 2 times, most recently from 7a40c4a to 1e7b192 Compare November 2, 2024 03:14

aakselrod changed the title ~~Reapply 8644~~ Reapply #8644 Nov 2, 2024

bhandras self-requested a review November 2, 2024 06:41

saubyk assigned aakselrod Nov 2, 2024

saubyk added database Related to the database/storage of LND postgres kvdb labels Nov 2, 2024

saubyk requested a review from Roasbeef November 5, 2024 03:11

saubyk added this to the v0.19.0 milestone Nov 5, 2024

aakselrod force-pushed the reapply-8644 branch from 1e67a84 to 899ae59 Compare November 5, 2024 03:16

aakselrod commented Nov 22, 2024

View reviewed changes

aakselrod mentioned this pull request Nov 25, 2024

waddrmgr: fix deadlock btcsuite/btcwallet#967

Merged

yyforyongyu reviewed Nov 26, 2024

View reviewed changes

aakselrod commented Nov 26, 2024

View reviewed changes

aakselrod force-pushed the reapply-8644 branch from c9dca65 to 9bdadf7 Compare November 27, 2024 00:38

aakselrod force-pushed the reapply-8644 branch from 9bdadf7 to f9cdf92 Compare November 27, 2024 01:02

This was referenced Nov 27, 2024

Reapply 8644 on 9260 #9312

Closed

Reapply 8644 on 9260 #9313

Open

aakselrod added 9 commits November 27, 2024 09:56

Reapply "kvdb/postgres: remove global application level lock"

d2dae45

This reverts commit 67419a7.

go.mod: use local kvdb to reapply removal of global postgres lock

8aa6414

batch: handle serialization errors correctly

c73bae6

channeldb: handle previously-unhandled errors

4382abf

sqldb: improve serialization error handling

cd2cb8e

Makefile: tune params for db-instance for postgres itests

43445ab

Makefile: log to file instead of console

5b84448

github workflow: save postgres log to zip file

22d3ff8

aakselrod force-pushed the reapply-8644 branch 2 times, most recently from a332e87 to b218301 Compare November 27, 2024 18:05

go.mod: update to latest btcwallet

689d5dd

aakselrod force-pushed the reapply-8644 branch from b218301 to 689d5dd Compare November 27, 2024 18:24

	// Make sure log file is closed and renamed if necessary.
	finalizeLogfile(hn)

	// Rename the etcd.log file if the node was running on embedded
	// etcd.
	finalizeEtcdLog(hn)

Reapply #8644 #9242

Are you sure you want to change the base?

Reapply #8644 #9242

Conversation

aakselrod commented Nov 2, 2024 • edited Loading

Change Description

Steps to Test

Pull Request Checklist

Testing

Code Style and Documentation

coderabbitai bot commented Nov 2, 2024 • edited Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

aakselrod commented Nov 2, 2024

aakselrod commented Nov 2, 2024

aakselrod commented Nov 2, 2024

aakselrod commented Nov 2, 2024

Roasbeef commented Nov 4, 2024

Roasbeef commented Nov 4, 2024

aakselrod commented Nov 4, 2024 • edited Loading

Roasbeef commented Nov 4, 2024

aakselrod commented Nov 5, 2024

aakselrod commented Nov 5, 2024

aakselrod commented Nov 5, 2024

Roasbeef commented Nov 6, 2024

Roasbeef commented Nov 6, 2024

Roasbeef commented Nov 6, 2024

Roasbeef commented Nov 6, 2024

aakselrod commented Nov 6, 2024 • edited Loading

aakselrod commented Nov 6, 2024

aakselrod left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakselrod Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef commented Nov 22, 2024

aakselrod commented Nov 23, 2024 • edited Loading

aakselrod commented Nov 25, 2024

yyforyongyu commented Nov 26, 2024

yyforyongyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakselrod Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakselrod commented Nov 26, 2024

aakselrod left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakselrod Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakselrod commented Nov 26, 2024

aakselrod commented Nov 27, 2024 • edited Loading

aakselrod commented Nov 27, 2024 • edited Loading

aakselrod commented Nov 27, 2024

aakselrod commented Nov 2, 2024 •

edited

Loading

coderabbitai bot commented Nov 2, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

aakselrod commented Nov 4, 2024 •

edited

Loading

aakselrod commented Nov 6, 2024 •

edited

Loading

aakselrod left a comment •

edited

Loading

aakselrod Nov 22, 2024 •

edited

Loading

aakselrod commented Nov 23, 2024 •

edited

Loading

aakselrod Nov 26, 2024 •

edited

Loading

aakselrod left a comment •

edited

Loading

aakselrod Nov 26, 2024 •

edited

Loading

aakselrod commented Nov 27, 2024 •

edited

Loading

aakselrod commented Nov 27, 2024 •

edited

Loading