Connection not released when request is cancelled #144

rogtejada · 2021-11-03T17:01:17Z

Bug Report

Versions

Driver: Driver: 0.8.8.RELEASE
Database: Posgresql 12.7
Java: 11
OS: Windows

Current Behavior

This issue was first opened on r2dbc-postgresql, some discussion can be seen there
pgjdbc/r2dbc-postgresql#459

Then it was opened here:
#140

The result was a fix on the pool but it does not fix the problem.

We have created a new project without spring dependencies and the connection keeps getting stucked.

You can find the source code here: https://github.com/jfrossetto/r2dbc-stress-cancel

pid	datname	usename	application_name	backend_start	query_start	query
3065	postgres	postgres	sample	2021-11-01 10:02:41.308	2021-11-01 10:02:45.159	SELECT 1
3064	postgres	postgres	sample	2021-11-01 10:02:41.307	2021-11-01 10:02:46.358	SELECT 1
3066	postgres	postgres	sample	2021-11-01 10:02:45.640	2021-11-01 10:02:50.739	SELECT 1

LOGFILE: https://github.com/jfrossetto/r2dbc-stress-cancel/blob/master/log_stresscancel.txt

There was another issue opened with the same problem here: #143

Stack trace

// your stack trace here

Table schema

Input Code

-- your SQL here;

Steps to reproduce

Input Code

// your code here;

Expected behavior/code

Possible Solution

Additional context

The text was updated successfully, but these errors were encountered:

rogtejada · 2021-11-03T17:49:26Z

@mp911de

Searching in: #143
I have cloned the source code posted here: https://github.com/stefannegele/connection-leak-demo

The behavior is the same, connection not released after cancellation. I just need to bump the numbers of executions from 5 to 10 to achieve the same result.

I have also tested it against the new 0.8.8 version and the problems still occurs.

But it seems more simple to reproduce with the following code presented on the repository above

rogtejada · 2021-11-04T17:57:30Z

@mp911de

I was looking over the implmentation of the connection pool
https://github.com/r2dbc/r2dbc-pool/blob/main/src/main/java/io/r2dbc/pool/ConnectionPool.java#:~:text=PoolBuilder%3CConnection%2C%20PoolConfig,Runtime.getRuntime().availableProcessors())%3B

The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.

I have tried using commitTransaction as release handler and it seems that the connection gets released, but not always a commit will be the best approach to release the connection.
I would like to hear your opinion on that matter.

DenisMd · 2021-11-05T13:30:28Z

I think we have the same issue in production. For version 0.8.7.RELEASE

Connection leak from grafana

our service sometime get close by grpc when clients refused their connections. I'll provide more information about the problem. But situation is not good :(

Does 0.9.0.M2 solve the problem?

mp911de · 2021-11-08T07:28:27Z

The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.

You can register a customizer through ConnectionPoolConfiguration.Builder.customizer(…) if you want to. Note that releaseHandler gets only called when the connection gets released by calling Connection.close().

I'm still wondering about the cause. If you can reliably reproduce the issue, then I would like to ask you to either register a doOnDiscard(…) callback that closes the connection (close + subscribe) or move around ´doFinally´. I think that timeout/onErrorMap/retry operators might interfere with the cancellation and that the cleanup guards only catch some cancellations but not all of them.

mp911de · 2021-11-08T07:28:59Z

@DenisMd 0.9 updates to the new R2DBC spec, it's not a rewrite of the pool or the like.

DenisMd · 2021-11-08T11:50:35Z

I was able to reproduce the leaks too.
Simple code for webflux request:

log.info("Before 1 sec")
databaseClient.sql("select pg_sleep(1)").await()
log.info("After 1 sec")

Normally log for 1 standalone request

2021-11-08 14:42:24.487 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] a.p.s.core.service.SdiService$Companion : Before 1 sec
2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.pool.ConnectionPool : Obtaining new connection from the driver
2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.postgresql.QUERY : Executing query: select pg_sleep(1)
2021-11-08 14:42:25.590 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] io.r2dbc.pool.PooledConnection : Releasing connection
2021-11-08 14:42:25.590 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] a.p.s.core.service.SdiService$Companion : After 1 sec

Then i launched jmeter load-scenario(100 hundred requests in parallel and 3.5 sec timeout for response).

After completion a prometheus-metric was:
r2dbc_pool_acquired_connections{name="connectionFactory",} 5.0

and never go down

build.config

plugins {
    id("application")
    id("org.springframework.boot") version "2.5.1"
    id("io.spring.dependency-management") version "1.0.11.RELEASE"
    kotlin("jvm") version "1.5.10"
    kotlin("plugin.spring") version "1.5.10"
    id("org.jlleitschuh.gradle.ktlint") version "10.1.0"
    id("org.jetbrains.kotlin.kapt") version "1.5.10"
}

dependencies {
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.boot:spring-boot-starter-webflux")

// Database r2dbc
implementation("org.springframework.boot:spring-boot-starter-data-r2dbc")
runtimeOnly("io.r2dbc:r2dbc-postgresql")
}

DenisMd · 2021-11-08T12:20:48Z

Interesting case, if i execute sql-request in a transaction there will be no leaks

log.info("Before 1 sec")
transactionalOperator.executeAndAwait {
databaseClient.sql("select pg_sleep(1)").await()
}
log.info("After 1 sec")

mp911de · 2021-11-10T09:16:21Z

Thanks for your reproducers. I'm finally able to reproduce the issue although it requires multiple attempts. Trying to figure out what's happening and where connections get lost.

[#144] Signed-off-by: Mark Paluch <[email protected]>

mp911de · 2021-11-10T10:27:08Z

Refining the allocation and cancellation guards indeed helped to reduce (fixed?) the test runs in which connections were lost in the connection pool. While tracing code paths that might skip connection release, I found that Flux.usingWhen(…) seems to see a connection that is emitted but not released. I need to analyze this further.

mp911de · 2021-11-10T10:50:33Z

I reported the issue back to the Reactor team to get their input.

rogtejada · 2021-11-10T18:33:02Z

@mp911de We could reproduce the issue even with the recent changes on r2dbc pool. But we agree that somehow the problem may be on the reactor implementation. Will investigate it on our side, any new discovery will be posted here.

jfrossetto · 2021-11-11T04:00:53Z

@mp911de
in our tests we add the following code in doFinally ,

                    return Operators.discardOnCancel(conn, () -> {
                        logger.debug("release discardOnCancel");
                        ref.release().subscribe();
                        return false;
                    })
                            .doFinally(s -> {
                                logger.debug("doFinally conn {}", s.toString());
                                if (!connection.isClosed()) {
                                    logger.debug("close onfinally");
                                    connection.close().subscribe();
                                }

                            });
                })

and with this all connections was released ... but we dont if this is the better solution

mp911de · 2021-11-11T07:37:03Z

We could reproduce the issue even with the recent changes on r2dbc pool

Please note that we cannot treat the problem as solved unless the leak in usingWhen is fixed. That lies outside of R2DBC Pool.

rogtejada · 2021-11-12T14:16:46Z

@mp911de

This approach seems an idea the connections stop beeing stucked, but i'm not sure if it closes the connection prematurely.
With this change 3 unit tests will fail.

shouldReusePooledConnectionAfterTimeout() -> no exception signal is emitted
shouldReportMetrics() -> it expects that aquiredSize will be one but it is zero since the connection is already closed
shouldCreateMultipleConnections() -> it expects two connections but one of them is already closed

Eventhough the behaviour seems correct the unit tests show that somehow the connection is beeing closed when it should still be open. @mp911de do you have any concerns about that? I would like to hear your opinion on that matter.

This is not a final solution but could be a workaround while the reactor problem is not solved

@mp911de in our tests we add the following code in doFinally ,

                    return Operators.discardOnCancel(conn, () -> {
                        logger.debug("release discardOnCancel");
                        ref.release().subscribe();
                        return false;
                    })
                            .doFinally(s -> {
                                logger.debug("doFinally conn {}", s.toString());
                                if (!connection.isClosed()) {
                                    logger.debug("close onfinally");
                                    connection.close().subscribe();
                                }

                            });
                })

and with this all connections was released ... but we dont if this is the better solution

figroc · 2021-12-15T12:09:54Z

We could reproduce the issue even with the recent changes on r2dbc pool

Please note that we cannot treat the problem as solved unless the leak in usingWhen is fixed. That lies outside of R2DBC Pool.

Is this usingWhen used in the r2dbc-proxy? I don't find any usingWhen in the r2dbc-pool.

Ernir · 2021-12-27T12:38:32Z

Commenting for increased visibility - looks like the reactor-core team has a potential solution, see reactor/reactor-core#2859 .

If you can reproduce the issue locally, confirmation that the fix works is desired. See reactor/reactor-core#2836 (comment) .

rogtejada · 2021-12-27T20:44:17Z

@Ernir @mp911de

The change done in the reactor core seems to fix this issue. However when running a lot of simultaneous requests (50k+) an error shows up

Error starting response. Replying error status io.r2dbc.postgresql.client.ReactorNettyClient$RequestQueueException: Cannot exchange messages because the request queue limit is exceeded

I guess that it is just a limitation of a single instance processing a lot of requests to try to simulate the lock behavior, in a real application this requests will be shared with multilple instances and this queue limit will be enough, am i right? There are some concerns we should address on this queue limitation?

figroc · 2021-12-31T01:56:48Z

@rogtejada Have you published a snapshot? We would like to test it in our project.

rogtejada · 2021-12-31T02:40:07Z

@figroc No. An issue was opened on project-reactor (reactor/reactor-core#2836) where a solution was proposed in this PR (reactor/reactor-core#2859). I have just build this branch locally and used this artifact to run my tests

mp911de · 2022-01-04T09:10:56Z

I can confirm that upgrading to Reactor Core 3.4.14 snapshots fixes the issue.

mp911de added the type: bug A general bug label Nov 10, 2021

mp911de added a commit that referenced this issue Nov 10, 2021

Ensure resource cleanup during cancellation of the validation query.

a951b79

[#144] Signed-off-by: Mark Paluch <[email protected]>

mp911de added a commit that referenced this issue Nov 10, 2021

Ensure resource cleanup during cancellation of the validation query.

a3611e2

[#144] Signed-off-by: Mark Paluch <[email protected]>

mp911de mentioned this issue Nov 10, 2021

Flux/Mono.usingWhen occasionally not running asyncCancel reactor/reactor-core#2836

Closed

mp911de added for: external-project For an external project and not something we can fix and removed type: bug A general bug labels Jan 4, 2022

mp911de closed this as completed Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection not released when request is cancelled #144

Connection not released when request is cancelled #144

rogtejada commented Nov 3, 2021

rogtejada commented Nov 3, 2021 •

edited

Loading

rogtejada commented Nov 4, 2021

DenisMd commented Nov 5, 2021

mp911de commented Nov 8, 2021

mp911de commented Nov 8, 2021

DenisMd commented Nov 8, 2021

DenisMd commented Nov 8, 2021

mp911de commented Nov 10, 2021

mp911de commented Nov 10, 2021

mp911de commented Nov 10, 2021

rogtejada commented Nov 10, 2021

jfrossetto commented Nov 11, 2021 •

edited

Loading

mp911de commented Nov 11, 2021

rogtejada commented Nov 12, 2021 •

edited

Loading

figroc commented Dec 15, 2021

Ernir commented Dec 27, 2021

rogtejada commented Dec 27, 2021

figroc commented Dec 31, 2021

rogtejada commented Dec 31, 2021 •

edited

Loading

mp911de commented Jan 4, 2022 •

edited

Loading

Connection not released when request is cancelled #144

Connection not released when request is cancelled #144

Comments

rogtejada commented Nov 3, 2021

Bug Report

Versions

Current Behavior

Table schema

Steps to reproduce

Expected behavior/code

Possible Solution

Additional context

rogtejada commented Nov 3, 2021 • edited Loading

rogtejada commented Nov 4, 2021

DenisMd commented Nov 5, 2021

mp911de commented Nov 8, 2021

mp911de commented Nov 8, 2021

DenisMd commented Nov 8, 2021

DenisMd commented Nov 8, 2021

mp911de commented Nov 10, 2021

mp911de commented Nov 10, 2021

mp911de commented Nov 10, 2021

rogtejada commented Nov 10, 2021

jfrossetto commented Nov 11, 2021 • edited Loading

mp911de commented Nov 11, 2021

rogtejada commented Nov 12, 2021 • edited Loading

figroc commented Dec 15, 2021

Ernir commented Dec 27, 2021

rogtejada commented Dec 27, 2021

figroc commented Dec 31, 2021

rogtejada commented Dec 31, 2021 • edited Loading

mp911de commented Jan 4, 2022 • edited Loading

rogtejada commented Nov 3, 2021 •

edited

Loading

jfrossetto commented Nov 11, 2021 •

edited

Loading

rogtejada commented Nov 12, 2021 •

edited

Loading

rogtejada commented Dec 31, 2021 •

edited

Loading

mp911de commented Jan 4, 2022 •

edited

Loading