-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close zookeeper topo connection on disconnect #17136
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"vitess.io/vitess/go/vt/zkctl" | ||
) | ||
|
||
func TestZkConnClosedOnDisconnect(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that this fails on main:
--- FAIL: TestZkConnClosedOnDisconnect (9.19s)
/Users/matt/git/vitess/go/vt/topo/zk2topo/zk_conn_test.go:60:
Error Trace: /Users/matt/git/vitess/go/vt/topo/zk2topo/zk_conn_test.go:60
Error: An error is expected but got nil.
Test: TestZkConnClosedOnDisconnect
FAIL
FAIL vitess.io/vitess/go/vt/topo/zk2topo 9.901s
@shanth96 one other thing, we need to fix the DCO: https://github.com/vitessio/vitess/pull/17136/checks?check_run_id=32492831232 In your case there's just one commit so it's just:
|
Signed-off-by: shanth96 <[email protected]>
f6d4f74
to
137bec3
Compare
Comments make sense. Fixed the tests and DCO |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17136 +/- ##
==========================================
- Coverage 69.43% 67.42% -2.01%
==========================================
Files 1570 1569 -1
Lines 203812 252116 +48304
==========================================
+ Hits 141517 169999 +28482
- Misses 62295 82117 +19822 ☔ View full report in Codecov by Sentry. |
@mattlord is this good to merge? |
Backports are reasonable given production impact. |
Signed-off-by: shanth96 <[email protected]>
Signed-off-by: shanth96 <[email protected]>
Signed-off-by: shanth96 <[email protected]>
…#17191) Signed-off-by: shanth96 <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
…#17193) Signed-off-by: shanth96 <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
@shanth96 the new unit test from this PR has quickly proven to be problematic in the CI. You can see the backport here for example: #17192 In investigating, the test also fails virtually every time for me locally as well. Does it pass for you locally? Perhaps there's some environment issues in play. If I can't address it soon then we'll have to skip it in the CI at least temporarily. |
Turns out to be a timing issue. It passes every time for me with this patch: diff --git a/go/vt/topo/zk2topo/zk_conn_test.go b/go/vt/topo/zk2topo/zk_conn_test.go
index b0b94c0707..e79987b562 100644
--- a/go/vt/topo/zk2topo/zk_conn_test.go
+++ b/go/vt/topo/zk2topo/zk_conn_test.go
@@ -19,6 +19,7 @@ package zk2topo
import (
"context"
"testing"
+ "time"
"github.com/stretchr/testify/require"
"github.com/z-division/go-zookeeper/zk"
@@ -42,12 +43,16 @@ func TestZkConnClosedOnDisconnect(t *testing.T) {
oldConn := conn.conn
// force a disconnect
- zkd.Shutdown()
- zkd.Start()
+ err = zkd.Shutdown()
+ require.NoError(t, err)
+ err = zkd.Start()
+ require.NoError(t, err)
// do another get to trigger a new connection
- _, _, err = conn.Get(context.Background(), "/")
- require.NoError(t, err, "Get() failed")
+ require.Eventually(t, func() bool {
+ _, _, err = conn.Get(context.Background(), "/")
+ return err == nil
+ }, 10*time.Second, 100*time.Millisecond)
// Check that old connection is closed
_, _, err = oldConn.Get("/")
I'll get that merged quickly on main and backported as well. |
…#17192) Signed-off-by: shanth96 <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
@mattlord thank you for taking a look and fixing it. |
Signed-off-by: shanth96 <[email protected]> Signed-off-by: Renan Rangel <[email protected]>
Description
This PR fixes a bug in the zookeeper topo logic that causes it to leak connections/memory during network partitions from zookeeper. As mentioned in the issue, this is due to the zookeeper connection library automatically re-opening connections under the hood, even after a disconnect. To fix this, this PR calls
conn.Close()
whenever there is a disconnect.This fix has been tested extensively in production in Shopify. A side effect of this fix is that it leads to a query latency regression when topo server is down as mentioned in #9147 unless the
srv_topo_cache_ttl
flag is set to a high value. This isn't the case currently because theSrvKeyspace
watch established here never terminates because the watch channel returned by the zk client is not invalidated on EOF error auth failure (source), and thezk.Conn
is never closed (that was the bug). So the watch continues to use an old channel from a leakedzk.Conn
and vitess just assumes the watch is healthy and returns the cached value hereWe should likely backport this up to v18 since it can lead to OOMkills and place additional load on zookeeper.
Related Issue(s)
Fixes #17076
Checklist
Deployment Notes