-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove broken panic handler #17354
Remove broken panic handler #17354
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #17354 +/- ##
=======================================
Coverage 67.46% 67.46%
=======================================
Files 1581 1581
Lines 253934 253934
=======================================
+ Hits 171308 171313 +5
+ Misses 82626 82621 -5 ☔ View full report in Codecov by Sentry. |
@@ -525,7 +524,6 @@ func launchRecoveryTablet(t *testing.T, tablet *cluster.Vttablet, binlogServer * | |||
tablet.MysqlctlProcess = *mysqlctlProcess | |||
extraArgs := []string{"--db-credentials-file", dbCredentialFile} | |||
tablet.MysqlctlProcess.InitDBFile = initDBFileWithPassword | |||
tablet.VttabletProcess.DbPassword = mysqlPassword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach @mattlord @rohit-nayak-ps This line was crashing this test, so we never actually ran it and it was always passing because of the now removed defer cluster.PanicHandler(nil)
call. Looks like the test might now be failing, but it means it has likely been failing forever / for a really long time.
Any ideas on how to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's failing here:
func testTabletRecovery(t *testing.T, binlogServer *binLogServer, lookupTimeout, restoreKeyspaceName, shardName, expectedRows string) {
recoveryTablet := clusterInstance.NewVttabletInstance("replica", 0, cell)
launchRecoveryTablet(t, recoveryTablet, binlogServer, lookupTimeout, restoreKeyspaceName, shardName)
sqlRes, err := recoveryTablet.VttabletProcess.QueryTablet(getCountID, keyspaceName, true)
require.NoError(t, err)
assert.Equal(t, expectedRows, sqlRes.Rows[0][0].String())
So my first guess would be that we need to wait for replication to catch up and for the expected result. Currently all it's doing is waiting for the tablet to become SERVING and then it queries it. Perhaps the tablet needs a second to replicate things. I would first just try adding a sleep of 30 seconds or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests the old-and-unsupported-and-actually-legacy Google Ripple binlog server based PITR.
- RFC: Backup/restore: remove legacy binlog-server based PITR code #16673
- https://vitess.io/docs/22.0/reference/features/recovery/#point-in-time-recovery-legacy-functionality-based-on-binlog-server
We should stop running this test altogether. Now is as good a time as ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbussink I've pushed a change to remove the test from running in shard 10
. For now this should fix this PR. Later on (on a different PR) I will purge the entire codebase and tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't the tests supposed to fail anyway if there is a panic? Seems like cluster.PanicHandler
should make the test fail if we recover a panic since the err
wouldn't be nil
:
vitess/go/test/endtoend/cluster/cluster_util.go
Lines 130 to 136 in 9b71606
func PanicHandler(t testing.TB) { | |
err := recover() | |
if t == nil { | |
return | |
} | |
require.Nilf(t, err, "panic occured in testcase %v", t.Name()) | |
} |
It doesn't. In all the cases removed here, So hence the test passes when this is used if the test panics. |
This handler would recover any arbitrary panic and make the test pass. This is a big anti-pattern since it means we can have many failing tests we don't know about and things are silently broken potentially. Running against CI to test the fallout of this. Signed-off-by: Dirkjan Bussink <[email protected]>
This test was never running since any panic would make a test pass before removing the recovery handler. Signed-off-by: Dirkjan Bussink <[email protected]>
vitessio#16673 Signed-off-by: Shlomi Noach <[email protected]>
When a test panics, it's way more useful to see the actual backtrace for the panic and not try to recover anything that hides that information. Signed-off-by: Dirkjan Bussink <[email protected]>
826c780
to
3e175b7
Compare
Backporting this too since it can be hiding legitimate test failures / problems and we don't want to have those crop up in release branches either then. |
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Shlomi Noach <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Signed-off-by: Harshit Gangal <[email protected]> Co-authored-by: Dirkjan Bussink <[email protected]> Co-authored-by: Shlomi Noach <[email protected]> Co-authored-by: Harshit Gangal <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Dirkjan Bussink <[email protected]> Co-authored-by: Shlomi Noach <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: Dirkjan Bussink <[email protected]> Co-authored-by: Shlomi Noach <[email protected]>
This handler would recover any arbitrary panic and make the test pass. This is a big anti-pattern since it means we can have many failing tests we don't know about and things are silently broken potentially.
Running against CI to test the fallout of this.
Checklist