Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: Use target tablet from health stats cache when checking replication status #14436

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions go/test/endtoend/cluster/cluster_process.go
Original file line number Diff line number Diff line change
Expand Up @@ -985,6 +985,15 @@ func (cluster *LocalProcessCluster) VtctlclientGetTablet(tablet *Vttablet) (*top
return &ti, nil
}

func (cluster *LocalProcessCluster) VtctlclientChangeTabletType(tablet *Vttablet, tabletType topodatapb.TabletType) error {
_, err := cluster.VtctlclientProcess.ExecuteCommandWithOutput("ChangeTabletType", "--", tablet.Alias, tabletType.String())
if err != nil {
return err
}

return nil
austenLacy marked this conversation as resolved.
Show resolved Hide resolved
}

// Teardown brings down the cluster by invoking teardown for individual processes
func (cluster *LocalProcessCluster) Teardown() {
PanicHandler(nil)
Expand Down
29 changes: 29 additions & 0 deletions go/test/endtoend/tabletgateway/vtgate_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import (
"time"

"vitess.io/vitess/go/test/endtoend/utils"
"vitess.io/vitess/go/vt/proto/topodata"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
Expand Down Expand Up @@ -69,6 +70,34 @@ func TestVtgateReplicationStatusCheck(t *testing.T) {
assert.Equal(t, expectNumRows, numRows, fmt.Sprintf("wrong number of results from show vitess_replication_status. Expected %d, got %d", expectNumRows, numRows))
}

func TestVtgateReplicationStatusCheckWithTabletTypeChange(t *testing.T) {
defer cluster.PanicHandler(t)
// Healthcheck interval on tablet is set to 1s, so sleep for 2s
time.Sleep(2 * time.Second)
Copy link
Contributor

@mattlord mattlord Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to be flaky in the CI due to unpredictable performance and occasional machine pauses. Any reason not to use the variable value (if we can access it) * 10 or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the other tests in this file. I can try running the test in a loop to see if it has any flakiness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it 10 times in a loop 2-3 times and didn't see any flakiness.

for i in {1..10}; do go test -count=1 -timeout 30s -run ^TestVtgateReplicationStatusCheckWithTabletTypeChange$ vitess.io/vitess/go/test/endtoend/tabletgateway; sleep 1; done
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 23.902s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 25.247s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 23.130s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 23.091s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 25.394s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 25.882s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 23.308s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 23.246s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 25.528s
ok      vitess.io/vitess/go/test/endtoend/tabletgateway 25.797s

verifyVtgateVariables(t, clusterInstance.VtgateProcess.VerifyURL)
ctx := context.Background()
conn, err := mysql.Connect(ctx, &vtParams)
require.NoError(t, err)
defer conn.Close()

// Only returns rows for REPLICA and RDONLY tablets -- so should be 2 of them
qr := utils.Exec(t, conn, "show vitess_replication_status like '%'")
expectNumRows := 2
numRows := len(qr.Rows)
assert.Equal(t, expectNumRows, numRows, fmt.Sprintf("wrong number of results from show vitess_replication_status. Expected %d, got %d", expectNumRows, numRows))

// change the RDONLY tablet to SPARE
rdOnlyTablet := clusterInstance.Keyspaces[0].Shards[0].Rdonly()
err = clusterInstance.VtctlclientChangeTabletType(rdOnlyTablet, topodata.TabletType_SPARE)
require.NoError(t, err)

// Only returns rows for REPLICA and RDONLY tablets -- so should be 1 of them since we updated 1 to spare
qr = utils.Exec(t, conn, "show vitess_replication_status like '%'")
expectNumRows = 1
numRows = len(qr.Rows)
assert.Equal(t, expectNumRows, numRows, fmt.Sprintf("wrong number of results from show vitess_replication_status. Expected %d, got %d", expectNumRows, numRows))
}

func verifyVtgateVariables(t *testing.T, url string) {
resp, err := http.Get(url)
require.NoError(t, err)
Expand Down
2 changes: 1 addition & 1 deletion go/vt/vtgate/executor.go
Original file line number Diff line number Diff line change
Expand Up @@ -901,7 +901,7 @@ func (e *Executor) showVitessReplicationStatus(ctx context.Context, filter *sqlp
for _, s := range status {
for _, ts := range s.TabletsStats {
// We only want to show REPLICA and RDONLY tablets
if ts.Tablet.Type != topodatapb.TabletType_REPLICA && ts.Tablet.Type != topodatapb.TabletType_RDONLY {
if ts.Target.TabletType != topodatapb.TabletType_REPLICA && ts.Target.TabletType != topodatapb.TabletType_RDONLY {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did confirm that Target.TabletType is what's being used elsewhere when examining/showing the type for tablets in the healthcheck cache. So this part seems right.

Following on from that, I think that we should change all usage of ts.Tablet.* in this function to ts.Target.*. For example, below we should change ts.Tablet.Keyspace to ts.Target.Keyspace and ts.Tablet.Shard to ts.Target.Shard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually worrisome that these two fields are not in sync. That is something we'll need to go back and review. Doesn't need to block this PR though.

continue
}

Expand Down
Loading