-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix: Use target tablet from health stats cache when checking replication status #14436
Bug fix: Use target tablet from health stats cache when checking replication status #14436
Conversation
…tatus Signed-off-by: Austen Lacy <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
…blets changing type Signed-off-by: Austen Lacy <[email protected]>
126dd17
to
f4920fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks, @austenLacy ! ❤️
I only had a couple of minor comments -- let me know what you think. Have you already run the new test locally in a loop to check for flakiness?
I can come back to this quickly when I hear back.
func TestVtgateReplicationStatusCheckWithTabletTypeChange(t *testing.T) { | ||
defer cluster.PanicHandler(t) | ||
// Healthcheck interval on tablet is set to 1s, so sleep for 2s | ||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to be flaky in the CI due to unpredictable performance and occasional machine pauses. Any reason not to use the variable value (if we can access it) * 10 or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from the other tests in this file. I can try running the test in a loop to see if it has any flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran it 10 times in a loop 2-3 times and didn't see any flakiness.
for i in {1..10}; do go test -count=1 -timeout 30s -run ^TestVtgateReplicationStatusCheckWithTabletTypeChange$ vitess.io/vitess/go/test/endtoend/tabletgateway; sleep 1; done
ok vitess.io/vitess/go/test/endtoend/tabletgateway 23.902s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 25.247s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 23.130s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 23.091s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 25.394s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 25.882s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 23.308s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 23.246s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 25.528s
ok vitess.io/vitess/go/test/endtoend/tabletgateway 25.797s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one more issue that I think we should address after having looked at other usage of TabletsStats
.
@deepthi could you please review this when you have time? No rush, but you know the healthcheck cache related code the best. Thanks!
@@ -901,7 +901,7 @@ func (e *Executor) showVitessReplicationStatus(ctx context.Context, filter *sqlp | |||
for _, s := range status { | |||
for _, ts := range s.TabletsStats { | |||
// We only want to show REPLICA and RDONLY tablets | |||
if ts.Tablet.Type != topodatapb.TabletType_REPLICA && ts.Tablet.Type != topodatapb.TabletType_RDONLY { | |||
if ts.Target.TabletType != topodatapb.TabletType_REPLICA && ts.Target.TabletType != topodatapb.TabletType_RDONLY { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did confirm that Target.TabletType
is what's being used elsewhere when examining/showing the type for tablets in the healthcheck cache. So this part seems right.
Following on from that, I think that we should change all usage of ts.Tablet.*
in this function to ts.Target.*
. For example, below we should change ts.Tablet.Keyspace
to ts.Target.Keyspace
and ts.Tablet.Shard
to ts.Target.Shard
.
…r#showVitessReplicationStatus Signed-off-by: Austen Lacy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @austenLacy ! ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on tests!
@@ -901,14 +901,14 @@ func (e *Executor) showVitessReplicationStatus(ctx context.Context, filter *sqlp | |||
for _, s := range status { | |||
for _, ts := range s.TabletsStats { | |||
// We only want to show REPLICA and RDONLY tablets | |||
if ts.Tablet.Type != topodatapb.TabletType_REPLICA && ts.Tablet.Type != topodatapb.TabletType_RDONLY { | |||
if ts.Target.TabletType != topodatapb.TabletType_REPLICA && ts.Target.TabletType != topodatapb.TabletType_RDONLY { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually worrisome that these two fields are not in sync. That is something we'll need to go back and review. Doesn't need to block this PR though.
…ication status (#14436) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]>
…ication status (vitessio#14436) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]>
…ication status (vitessio#14436) (#128) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]>
…n checking replication status (#14436) (#14456) Signed-off-by: Austen Lacy <[email protected]> Signed-off-by: Harshit Gangal <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Austen Lacy <[email protected]> Co-authored-by: Harshit Gangal <[email protected]>
…ication status (vitessio#14436) (#128) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]>
…ication status (vitessio#14436) (#128) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]>
…ication status (vitessio#14436) (#128) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]>
…ication status (vitessio#14436) (#128) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]> (cherry picked from commit f757ff2)
…ication status (vitessio#14436) (#128) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]> (cherry picked from commit f757ff2) (cherry picked from commit 7fabb2d)
…ication status (vitessio#14436) (#128) Signed-off-by: Austen Lacy <[email protected]> Co-authored-by: Austen Lacy <[email protected]> (cherry picked from commit f757ff2) (cherry picked from commit 7fabb2d) (cherry picked from commit a5cc396)
Description
When determining which tablets to show in
show vitess_replication_status
Vitess is excluding tablets if their states were changed. This now uses the target tablet from the heath check cache which represents the expected state of the tablets.Related Issue(s)
REPLICA
does not update results ofSHOW vitess_replication_status
#14432Checklist
Deployment Notes