Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: enable TCP keepalive on database connections #1676

Closed
psFried opened this issue Oct 3, 2024 · 4 comments
Closed

agent: enable TCP keepalive on database connections #1676

psFried opened this issue Oct 3, 2024 · 4 comments
Labels
chore Internal cleanups, refactoring, or improvements control-plane

Comments

@psFried
Copy link
Member

psFried commented Oct 3, 2024

We've seen an issue where the database is waiting on the agent to send prepared statement parameters, which never arrive. While it's not clear what exactly caused this, enabling TCP keepalive might at least reduce the possibility of that sort of thing happening due to a broken connection.

Unfortunately, sqlx doesn't currently have a way to enable keepalive, so we'll need to wait until launchbadge/sqlx#3540 is resolved before we can address this.

@psFried psFried added chore Internal cleanups, refactoring, or improvements control-plane labels Oct 3, 2024
@abonander
Copy link

We've seen an issue where the database is waiting on the agent to send prepared statement parameters, which never arrive.

You didn't mention this in the linked issue.

This may be due to cancellation. The async methods on PgConnection are generally not cancel-safe, so if a call is canceled in the middle of preparing a statement, it won't know to resume the process.

It has some ability to recover, but it depends a lot on where in the process it's cancelled.

I can glance through your code and see if there's any obvious sources of cancellation issues.

@abonander
Copy link

abonander commented Oct 3, 2024

I see you're using Axum. It's great, but one thing to note is that it does cancel the task if the client disconnects before the response is returned. This is the expected behavior, and actually happens inside Hyper, not Axum itself.

If this is the cause, and the connection is left in a state where the database is stuck waiting for a message that will never come, it'll stall at this Connection::ping() call in the task that checks the connection before returning it to Pool: https://github.com/launchbadge/sqlx/blob/19f40d87a669e41081a629a469359d341c9223d6/sqlx-core/src/pool/connection.rs#L314

This would result in connections being effectively leaked from the pool, and future calls returning Error:PoolTimedOut. Come to think of it, this is probably the major source of that error for a lot of people.

The thing is, TCP keepalive wouldn't save you here anyway. That's handled entirely at the transport layer, completely hidden from the application. If the database is stalled on a read but the server's kernel is still properly managing the socket, it'll answer the keepalive messages and the connection won't time out.

What we need to do is wrap that ping() call, or the whole return_to_pool() call, in a timeout() call that prevents a stall from permanently leaking the connection.

In the meantime, you could probably mitigate this by setting an after_release callback that just does connection.ping() wrapped in tokio::time::timeout(). If the connection is stalled, it'll catch it. Returning Err (or Ok(false) if you don't want the error to be logged) from the callback causes the connection to be immediately closed.

@abonander
Copy link

My ultimate plan is to fix cancellation issues for good by just having the connection state machine run on a background task. This is what a ton of other crates do.

We originally didn't want it to work that way because we thought it might be more efficient for the connection logic to execute directly on the calling task, but it didn't really turn out that way.

@psFried
Copy link
Member Author

psFried commented Oct 21, 2024

@abonander thanks for your comments and help here. We had some confusion about the query in pg_stat_activity, which made us think that the database was waiting on the client. But in fact the query column just shows the last query that was run, even after it's completed. So the specific issue we were seeing turned out to be nothing to do with the database or sqlx at all.

So I'm closing this issue for now, since I don't see a specific need for TCP keepalive. We'd definitely enable it if keepalive ends up being supported in sqlx, but it no longer seems at all urgent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Internal cleanups, refactoring, or improvements control-plane
Projects
None yet
Development

No branches or pull requests

2 participants