-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ktesting: tone down warning about leaked test goroutine #401
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -217,8 +217,19 @@ type tloggerShared struct { | |
// t gets cleared when the test is completed. | ||
t TL | ||
|
||
// We warn once when a leaked goroutine is detected because | ||
// it logs after test completion. | ||
// The time when the test completed. | ||
stopTime time.Time | ||
|
||
// We warn once when a leaked goroutine logs after test completion. | ||
// | ||
// Not terminating immediately is fairly normal: many controllers | ||
// log "terminating" messages while they shut down, which often is | ||
// right after test completion, if that is when the test cancels the | ||
// context and then doesn't wait for goroutines (which is often | ||
// not possible). | ||
// | ||
// Therefore there is the [stopGracePeriod] during which messages get | ||
// logged to the global logger without the warning. | ||
goroutineWarningDone bool | ||
|
||
formatter serialize.Formatter | ||
|
@@ -228,10 +239,15 @@ type tloggerShared struct { | |
callDepth int | ||
} | ||
|
||
// Log output of a leaked goroutine during this period after test completion | ||
// does not trigger the warning. | ||
const stopGracePeriod = 10 * time.Second | ||
|
||
func (ls *tloggerShared) stop() { | ||
ls.mutex.Lock() | ||
defer ls.mutex.Unlock() | ||
ls.t = nil | ||
ls.stopTime = time.Now() | ||
} | ||
|
||
// tlogger is the actual LogSink implementation. | ||
|
@@ -241,6 +257,8 @@ type tlogger struct { | |
values []interface{} | ||
} | ||
|
||
// fallbackLogger is called while l.shared.mutex is locked and after it has | ||
// been determined that the original testing.TB is no longer usable. | ||
func (l tlogger) fallbackLogger() logr.Logger { | ||
logger := klog.Background().WithValues(l.values...).WithName(l.shared.testName + " leaked goroutine") | ||
if l.prefix != "" { | ||
|
@@ -250,8 +268,12 @@ func (l tlogger) fallbackLogger() logr.Logger { | |
logger = logger.WithCallDepth(l.shared.callDepth + 1) | ||
|
||
if !l.shared.goroutineWarningDone { | ||
logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false))) | ||
l.shared.goroutineWarningDone = true | ||
duration := time.Since(l.shared.stopTime) | ||
if duration > stopGracePeriod { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems will bury the real goroutine leakage if the duration is less than stopGracePeriod, do I understand right here? Another point I can't get here is if the test completes, why will it wait for another 10s? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the "leaked" goroutine stops after less than There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Totally misunderstood the meaning here. I'm not quite understand how it works but I still have a silly question like: Will we wait and detect the goroutine leakage after test completes? Or why the duration from stop will be greater than There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, we don't wait. That is something that uber/goleak does, but not this code. This code is merely reporting on leaked goroutines if and when it detects them, which is when they keep logging. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recalled goleak has maxRetries, then can't we just configure the goleak? It's configurable now. Both the retry times and sleep duration. I think somehow the same as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Definitely agree.
Yes for klog... I mean for callers, like test framework. I suggested in this way is just because the goroutine leakage detection duration is mostly decided by goleak, the maxRetries(default to 20), maxSleep(default to 100ms) specifically, the As a user of klog, happy with either way to avoid confusion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So LGTM?
Leak detection with goleak only works when tests run sequentially or in TestMain. I'm not sure how useful it is in an E2E test suite, because the actual code under test runs elsewhere. For integration and unit tests, individual test owners have to integrate goleak. Some apiserver-based tests have it.
So this is about how klog will interact with goleak. I don't have a strong opinion here. I picked 10 seconds because it should be long enough to no produce "false positives" for the "I am shutting down after test completion" goroutines and short enough to give feedback for others. Because it's not causing a test failure, it can be lower than the goleak timeouts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. /lgtm There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are these 10s added to the test duration? 10s sounds like a lot to me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, they are not added. At the time that this gets called, the test has already completed. |
||
|
||
logger.WithCallDepth(1).Info("WARNING: test kept at least one goroutine running after test completion", "timeSinceCompletion", duration.Round(time.Second), "callstack", string(dbg.Stacks(false))) | ||
l.shared.goroutineWarningDone = true | ||
} | ||
} | ||
return logger | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this access need to be locked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Good catch, thanks.
/lgtm cancel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrrm, no.
fallbackLogger
gets called while the mutex is locked. This wasn't clear here, so I've added a comment explaining that (https://github.com/kubernetes/klog/compare/83145e470293dc67bede46353012cededd97e031..ed62027e604261af3e9c42c7a50092a0db074f19).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea: Okay for another LGTM?