Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection not released when request is cancelled #144

Closed
rogtejada opened this issue Nov 3, 2021 · 20 comments
Closed

Connection not released when request is cancelled #144

rogtejada opened this issue Nov 3, 2021 · 20 comments
Labels
for: external-project For an external project and not something we can fix

Comments

@rogtejada
Copy link

Bug Report

Versions

  • Driver: Driver: 0.8.8.RELEASE
  • Database: Posgresql 12.7
  • Java: 11
  • OS: Windows

Current Behavior

This issue was first opened on r2dbc-postgresql, some discussion can be seen there
pgjdbc/r2dbc-postgresql#459

Then it was opened here:
#140

The result was a fix on the pool but it does not fix the problem.

We have created a new project without spring dependencies and the connection keeps getting stucked.

You can find the source code here: https://github.com/jfrossetto/r2dbc-stress-cancel

pid datname usename application_name backend_start query_start query
3065 postgres postgres sample 2021-11-01 10:02:41.308 2021-11-01 10:02:45.159 SELECT 1
3064 postgres postgres sample 2021-11-01 10:02:41.307 2021-11-01 10:02:46.358 SELECT 1
3066 postgres postgres sample 2021-11-01 10:02:45.640 2021-11-01 10:02:50.739 SELECT 1

LOGFILE: https://github.com/jfrossetto/r2dbc-stress-cancel/blob/master/log_stresscancel.txt

There was another issue opened with the same problem here: #143

Stack trace
// your stack trace here

Table schema

Input Code
-- your SQL here;

Steps to reproduce

Input Code
// your code here;

Expected behavior/code

Possible Solution

Additional context

@rogtejada
Copy link
Author

rogtejada commented Nov 3, 2021

@mp911de

Searching in: #143
I have cloned the source code posted here: https://github.com/stefannegele/connection-leak-demo

The behavior is the same, connection not released after cancellation. I just need to bump the numbers of executions from 5 to 10 to achieve the same result.

I have also tested it against the new 0.8.8 version and the problems still occurs.

But it seems more simple to reproduce with the following code presented on the repository above

@rogtejada
Copy link
Author

@mp911de

I was looking over the implmentation of the connection pool
https://github.com/r2dbc/r2dbc-pool/blob/main/src/main/java/io/r2dbc/pool/ConnectionPool.java#:~:text=PoolBuilder%3CConnection%2C%20PoolConfig,Runtime.getRuntime().availableProcessors())%3B

The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.

I have tried using commitTransaction as release handler and it seems that the connection gets released, but not always a commit will be the best approach to release the connection.
I would like to hear your opinion on that matter.

@DenisMd
Copy link

DenisMd commented Nov 5, 2021

I think we have the same issue in production. For version 0.8.7.RELEASE

Connection leak from grafana
image

our service sometime get close by grpc when clients refused their connections. I'll provide more information about the problem. But situation is not good :(

Does 0.9.0.M2 solve the problem?

@mp911de
Copy link
Member

mp911de commented Nov 8, 2021

The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.

You can register a customizer through ConnectionPoolConfiguration.Builder.customizer(…) if you want to. Note that releaseHandler gets only called when the connection gets released by calling Connection.close().

I'm still wondering about the cause. If you can reliably reproduce the issue, then I would like to ask you to either register a doOnDiscard(…) callback that closes the connection (close + subscribe) or move around ´doFinally´. I think that timeout/onErrorMap/retry operators might interfere with the cancellation and that the cleanup guards only catch some cancellations but not all of them.

@mp911de
Copy link
Member

mp911de commented Nov 8, 2021

@DenisMd 0.9 updates to the new R2DBC spec, it's not a rewrite of the pool or the like.

@DenisMd
Copy link

DenisMd commented Nov 8, 2021

I was able to reproduce the leaks too.
Simple code for webflux request:

log.info("Before 1 sec")
databaseClient.sql("select pg_sleep(1)").await()
log.info("After 1 sec")

Normally log for 1 standalone request

2021-11-08 14:42:24.487 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] a.p.s.core.service.SdiService$Companion : Before 1 sec
2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.pool.ConnectionPool : Obtaining new connection from the driver
2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.postgresql.QUERY : Executing query: select pg_sleep(1)
2021-11-08 14:42:25.590 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] io.r2dbc.pool.PooledConnection : Releasing connection
2021-11-08 14:42:25.590 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] a.p.s.core.service.SdiService$Companion : After 1 sec

Then i launched jmeter load-scenario(100 hundred requests in parallel and 3.5 sec timeout for response).

After completion a prometheus-metric was:
r2dbc_pool_acquired_connections{name="connectionFactory",} 5.0

and never go down

build.config

plugins {
    id("application")
    id("org.springframework.boot") version "2.5.1"
    id("io.spring.dependency-management") version "1.0.11.RELEASE"
    kotlin("jvm") version "1.5.10"
    kotlin("plugin.spring") version "1.5.10"
    id("org.jlleitschuh.gradle.ktlint") version "10.1.0"
    id("org.jetbrains.kotlin.kapt") version "1.5.10"
}

dependencies {
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.boot:spring-boot-starter-webflux")

// Database r2dbc
implementation("org.springframework.boot:spring-boot-starter-data-r2dbc")
runtimeOnly("io.r2dbc:r2dbc-postgresql")
} 

@DenisMd
Copy link

DenisMd commented Nov 8, 2021

Interesting case, if i execute sql-request in a transaction there will be no leaks

log.info("Before 1 sec")
transactionalOperator.executeAndAwait {
databaseClient.sql("select pg_sleep(1)").await()
}
log.info("After 1 sec")

@mp911de mp911de added the type: bug A general bug label Nov 10, 2021
@mp911de
Copy link
Member

mp911de commented Nov 10, 2021

Thanks for your reproducers. I'm finally able to reproduce the issue although it requires multiple attempts. Trying to figure out what's happening and where connections get lost.

mp911de added a commit that referenced this issue Nov 10, 2021
mp911de added a commit that referenced this issue Nov 10, 2021
@mp911de
Copy link
Member

mp911de commented Nov 10, 2021

Refining the allocation and cancellation guards indeed helped to reduce (fixed?) the test runs in which connections were lost in the connection pool. While tracing code paths that might skip connection release, I found that Flux.usingWhen(…) seems to see a connection that is emitted but not released. I need to analyze this further.

@mp911de
Copy link
Member

mp911de commented Nov 10, 2021

I reported the issue back to the Reactor team to get their input.

@rogtejada
Copy link
Author

@mp911de We could reproduce the issue even with the recent changes on r2dbc pool. But we agree that somehow the problem may be on the reactor implementation. Will investigate it on our side, any new discovery will be posted here.

@jfrossetto
Copy link

jfrossetto commented Nov 11, 2021

@mp911de
in our tests we add the following code in doFinally ,

                    return Operators.discardOnCancel(conn, () -> {
                        logger.debug("release discardOnCancel");
                        ref.release().subscribe();
                        return false;
                    })
                            .doFinally(s -> {
                                logger.debug("doFinally conn {}", s.toString());
                                if (!connection.isClosed()) {
                                    logger.debug("close onfinally");
                                    connection.close().subscribe();
                                }

                            });
                })

and with this all connections was released ... but we dont if this is the better solution

@mp911de
Copy link
Member

mp911de commented Nov 11, 2021

We could reproduce the issue even with the recent changes on r2dbc pool

Please note that we cannot treat the problem as solved unless the leak in usingWhen is fixed. That lies outside of R2DBC Pool.

@rogtejada
Copy link
Author

rogtejada commented Nov 12, 2021

@mp911de

This approach seems an idea the connections stop beeing stucked, but i'm not sure if it closes the connection prematurely.
With this change 3 unit tests will fail.

shouldReusePooledConnectionAfterTimeout() -> no exception signal is emitted
shouldReportMetrics() -> it expects that aquiredSize will be one but it is zero since the connection is already closed
shouldCreateMultipleConnections() -> it expects two connections but one of them is already closed

Eventhough the behaviour seems correct the unit tests show that somehow the connection is beeing closed when it should still be open. @mp911de do you have any concerns about that? I would like to hear your opinion on that matter.

This is not a final solution but could be a workaround while the reactor problem is not solved

@mp911de in our tests we add the following code in doFinally ,

                    return Operators.discardOnCancel(conn, () -> {
                        logger.debug("release discardOnCancel");
                        ref.release().subscribe();
                        return false;
                    })
                            .doFinally(s -> {
                                logger.debug("doFinally conn {}", s.toString());
                                if (!connection.isClosed()) {
                                    logger.debug("close onfinally");
                                    connection.close().subscribe();
                                }

                            });
                })

and with this all connections was released ... but we dont if this is the better solution

@figroc
Copy link

figroc commented Dec 15, 2021

We could reproduce the issue even with the recent changes on r2dbc pool

Please note that we cannot treat the problem as solved unless the leak in usingWhen is fixed. That lies outside of R2DBC Pool.

Is this usingWhen used in the r2dbc-proxy? I don't find any usingWhen in the r2dbc-pool.

@Ernir
Copy link

Ernir commented Dec 27, 2021

Commenting for increased visibility - looks like the reactor-core team has a potential solution, see reactor/reactor-core#2859 .

If you can reproduce the issue locally, confirmation that the fix works is desired. See reactor/reactor-core#2836 (comment) .

@rogtejada
Copy link
Author

@Ernir @mp911de

The change done in the reactor core seems to fix this issue. However when running a lot of simultaneous requests (50k+) an error shows up

Error starting response. Replying error status io.r2dbc.postgresql.client.ReactorNettyClient$RequestQueueException: Cannot exchange messages because the request queue limit is exceeded

I guess that it is just a limitation of a single instance processing a lot of requests to try to simulate the lock behavior, in a real application this requests will be shared with multilple instances and this queue limit will be enough, am i right? There are some concerns we should address on this queue limitation?

@figroc
Copy link

figroc commented Dec 31, 2021

@rogtejada Have you published a snapshot? We would like to test it in our project.

@rogtejada
Copy link
Author

rogtejada commented Dec 31, 2021

@figroc No. An issue was opened on project-reactor (reactor/reactor-core#2836) where a solution was proposed in this PR (reactor/reactor-core#2859). I have just build this branch locally and used this artifact to run my tests

@mp911de mp911de added for: external-project For an external project and not something we can fix and removed type: bug A general bug labels Jan 4, 2022
@mp911de
Copy link
Member

mp911de commented Jan 4, 2022

I can confirm that upgrading to Reactor Core 3.4.14 snapshots fixes the issue.

@mp911de mp911de closed this as completed Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: external-project For an external project and not something we can fix
Projects
None yet
Development

No branches or pull requests

6 participants