Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close zookeeper topo connection on disconnect #17136

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

shanth96
Copy link
Contributor

@shanth96 shanth96 commented Nov 4, 2024

Description

This PR fixes a bug in the zookeeper topo logic that causes it to leak connections/memory during network partitions from zookeeper. As mentioned in the issue, this is due to the zookeeper connection library automatically re-opening connections under the hood, even after a disconnect. To fix this, this PR calls conn.Close() whenever there is a disconnect.

This fix has been tested extensively in production in Shopify. A side effect of this fix is that it leads to a query latency regression when topo server is down as mentioned in #9147 unless the srv_topo_cache_ttl flag is set to a high value. This isn't the case currently because the SrvKeyspace watch established here never terminates because the watch channel returned by the zk client is not invalidated on EOF error auth failure (source), and the zk.Conn is never closed (that was the bug). So the watch continues to use an old channel from a leaked zk.Conn and vitess just assumes the watch is healthy and returns the cached value here

We should likely backport this up to v18 since it can lead to OOMkills and place additional load on zookeeper.

Related Issue(s)

Fixes #17076

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Nov 4, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Nov 4, 2024
@github-actions github-actions bot added this to the v22.0.0 milestone Nov 4, 2024
@mattlord mattlord added Type: Bug Component: Topology and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Nov 4, 2024
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @shanth96 ! ❤️ We should update the unit test as noted, but otherwise it looks good. Hopefully my comments there make sense?

I'm OK backporting this, what do you think @deepthi ?

go/vt/topo/zk2topo/zk_conn_test.go Outdated Show resolved Hide resolved
go/vt/topo/zk2topo/zk_conn_test.go Outdated Show resolved Hide resolved
"vitess.io/vitess/go/vt/zkctl"
)

func TestZkConnClosedOnDisconnect(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that this fails on main:

--- FAIL: TestZkConnClosedOnDisconnect (9.19s)
    /Users/matt/git/vitess/go/vt/topo/zk2topo/zk_conn_test.go:60: 
        	Error Trace:	/Users/matt/git/vitess/go/vt/topo/zk2topo/zk_conn_test.go:60
        	Error:      	An error is expected but got nil.
        	Test:       	TestZkConnClosedOnDisconnect
FAIL
FAIL	vitess.io/vitess/go/vt/topo/zk2topo	9.901s

@mattlord
Copy link
Contributor

mattlord commented Nov 5, 2024

@shanth96 one other thing, we need to fix the DCO: https://github.com/vitessio/vitess/pull/17136/checks?check_run_id=32492831232

In your case there's just one commit so it's just:

git commit -a -s --amend
git push --force-with-lease 

@shanth96 shanth96 force-pushed the fix-zk-conn-leak-bug branch from f6d4f74 to 137bec3 Compare November 5, 2024 14:45
@shanth96
Copy link
Contributor Author

shanth96 commented Nov 5, 2024

We should update the unit test as noted, but otherwise it looks good. Hopefully my comments there make sense?

Comments make sense. Fixed the tests and DCO

Copy link

codecov bot commented Nov 5, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 67.42%. Comparing base (f40e076) to head (137bec3).
Report is 83 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/topo/zk2topo/zk_conn.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17136      +/-   ##
==========================================
- Coverage   69.43%   67.42%   -2.01%     
==========================================
  Files        1570     1569       -1     
  Lines      203812   252116   +48304     
==========================================
+ Hits       141517   169999   +28482     
- Misses      62295    82117   +19822     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shanth96
Copy link
Contributor Author

shanth96 commented Nov 8, 2024

@mattlord is this good to merge?

@deepthi deepthi added Backport to: release-19.0 Needs to be back ported to release-19.0 Backport to: release-20.0 Needs to be backport to release-20.0 Backport to: release-21.0 Needs to be backport to release-21.0 labels Nov 8, 2024
@deepthi
Copy link
Member

deepthi commented Nov 8, 2024

Backports are reasonable given production impact.

@deepthi deepthi merged commit 8af57c0 into vitessio:main Nov 8, 2024
101 checks passed
vitess-bot pushed a commit that referenced this pull request Nov 8, 2024
vitess-bot pushed a commit that referenced this pull request Nov 8, 2024
vitess-bot pushed a commit that referenced this pull request Nov 8, 2024
mattlord pushed a commit that referenced this pull request Nov 9, 2024
…#17191)

Signed-off-by: shanth96 <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
mattlord pushed a commit that referenced this pull request Nov 9, 2024
…#17193)

Signed-off-by: shanth96 <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
@mattlord
Copy link
Contributor

mattlord commented Nov 9, 2024

@shanth96 the new unit test from this PR has quickly proven to be problematic in the CI. You can see the backport here for example: #17192

In investigating, the test also fails virtually every time for me locally as well. Does it pass for you locally? Perhaps there's some environment issues in play. If I can't address it soon then we'll have to skip it in the CI at least temporarily.

@mattlord
Copy link
Contributor

mattlord commented Nov 9, 2024

Turns out to be a timing issue. It passes every time for me with this patch:

diff --git a/go/vt/topo/zk2topo/zk_conn_test.go b/go/vt/topo/zk2topo/zk_conn_test.go
index b0b94c0707..e79987b562 100644
--- a/go/vt/topo/zk2topo/zk_conn_test.go
+++ b/go/vt/topo/zk2topo/zk_conn_test.go
@@ -19,6 +19,7 @@ package zk2topo
 import (
 	"context"
 	"testing"
+	"time"
 
 	"github.com/stretchr/testify/require"
 	"github.com/z-division/go-zookeeper/zk"
@@ -42,12 +43,16 @@ func TestZkConnClosedOnDisconnect(t *testing.T) {
 	oldConn := conn.conn
 
 	// force a disconnect
-	zkd.Shutdown()
-	zkd.Start()
+	err = zkd.Shutdown()
+	require.NoError(t, err)
+	err = zkd.Start()
+	require.NoError(t, err)
 
 	// do another get to trigger a new connection
-	_, _, err = conn.Get(context.Background(), "/")
-	require.NoError(t, err, "Get() failed")
+	require.Eventually(t, func() bool {
+		_, _, err = conn.Get(context.Background(), "/")
+		return err == nil
+	}, 10*time.Second, 100*time.Millisecond)
 
 	// Check that old connection is closed
 	_, _, err = oldConn.Get("/")

I'll get that merged quickly on main and backported as well.

mattlord pushed a commit that referenced this pull request Nov 9, 2024
…#17192)

Signed-off-by: shanth96 <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
@shanth96
Copy link
Contributor Author

@shanth96 the new unit test from this PR has quickly proven to be problematic in the CI. You can see the backport here for example: #17192

In investigating, the test also fails virtually every time for me locally as well. Does it pass for you locally? Perhaps there's some environment issues in play. If I can't address it soon then we'll have to skip it in the CI at least temporarily.

@mattlord thank you for taking a look and fixing it.

rvrangel pushed a commit to rvrangel/vitess that referenced this pull request Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport to: release-19.0 Needs to be back ported to release-19.0 Backport to: release-20.0 Needs to be backport to release-20.0 Backport to: release-21.0 Needs to be backport to release-21.0 Component: Topology Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: zookeeper topo connection leak
4 participants