Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ktesting: tone down warning about leaked test goroutine #401

Merged
merged 1 commit into from
Jun 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 26 additions & 4 deletions ktesting/testinglogger.go
Original file line number Diff line number Diff line change
Expand Up @@ -217,8 +217,19 @@ type tloggerShared struct {
// t gets cleared when the test is completed.
t TL

// We warn once when a leaked goroutine is detected because
// it logs after test completion.
// The time when the test completed.
stopTime time.Time

// We warn once when a leaked goroutine logs after test completion.
//
// Not terminating immediately is fairly normal: many controllers
// log "terminating" messages while they shut down, which often is
// right after test completion, if that is when the test cancels the
// context and then doesn't wait for goroutines (which is often
// not possible).
//
// Therefore there is the [stopGracePeriod] during which messages get
// logged to the global logger without the warning.
goroutineWarningDone bool

formatter serialize.Formatter
Expand All @@ -228,10 +239,15 @@ type tloggerShared struct {
callDepth int
}

// Log output of a leaked goroutine during this period after test completion
// does not trigger the warning.
const stopGracePeriod = 10 * time.Second

func (ls *tloggerShared) stop() {
ls.mutex.Lock()
defer ls.mutex.Unlock()
ls.t = nil
ls.stopTime = time.Now()
}

// tlogger is the actual LogSink implementation.
Expand All @@ -241,6 +257,8 @@ type tlogger struct {
values []interface{}
}

// fallbackLogger is called while l.shared.mutex is locked and after it has
// been determined that the original testing.TB is no longer usable.
func (l tlogger) fallbackLogger() logr.Logger {
logger := klog.Background().WithValues(l.values...).WithName(l.shared.testName + " leaked goroutine")
if l.prefix != "" {
Expand All @@ -250,8 +268,12 @@ func (l tlogger) fallbackLogger() logr.Logger {
logger = logger.WithCallDepth(l.shared.callDepth + 1)

if !l.shared.goroutineWarningDone {
logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false)))
l.shared.goroutineWarningDone = true
duration := time.Since(l.shared.stopTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this access need to be locked?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Good catch, thanks.

/lgtm cancel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrrm, no. fallbackLogger gets called while the mutex is locked. This wasn't clear here, so I've added a comment explaining that (https://github.com/kubernetes/klog/compare/83145e470293dc67bede46353012cededd97e031..ed62027e604261af3e9c42c7a50092a0db074f19).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea: Okay for another LGTM?

if duration > stopGracePeriod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems will bury the real goroutine leakage if the duration is less than stopGracePeriod, do I understand right here?

Another point I can't get here is if the test completes, why will it wait for another 10s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "leaked" goroutine stops after less than stopGracePeriod, then it wasn't really a leak. It was "just" a test that didn't block and wait for goroutines at the end. This is very common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally misunderstood the meaning here. I'm not quite understand how it works but I still have a silly question like: Will we wait and detect the goroutine leakage after test completes? Or why the duration from stop will be greater than stopGracePeriod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't wait. That is something that uber/goleak does, but not this code. This code is merely reporting on leaked goroutines if and when it detects them, which is when they keep logging.

The stopGracePeriod is the grace period that goroutines have before their log output triggers this warning. I can rename it if you have a better suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recalled goleak has maxRetries, then can't we just configure the goleak? It's configurable now. Both the retry times and sleep duration. I think somehow the same as stopGracePeriod. Will this lead to other problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the warning for goroutines that are more likely to be real leaks while not printing it for goroutines which terminate quickly after test completion.

Definitely agree.

We cannot depend on goleak here

Yes for klog... I mean for callers, like test framework. I suggested in this way is just because the goroutine leakage detection duration is mostly decided by goleak, the maxRetries(default to 20), maxSleep(default to 100ms) specifically, the duration ~= maxRetries * (maxSleep + filterStacks), have no idea whether 10s fits here. However, the feature is reverted... uber-go/goleak#116, so no way to configure now.

As a user of klog, happy with either way to avoid confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree.

So LGTM?

I mean for callers, like test framework.

Leak detection with goleak only works when tests run sequentially or in TestMain. I'm not sure how useful it is in an E2E test suite, because the actual code under test runs elsewhere. For integration and unit tests, individual test owners have to integrate goleak. Some apiserver-based tests have it.

have no idea whether 10s fits here

So this is about how klog will interact with goleak. I don't have a strong opinion here. I picked 10 seconds because it should be long enough to no produce "false positives" for the "I am shutting down after test completion" goroutines and short enough to give feedback for others. Because it's not causing a test failure, it can be lower than the goleak timeouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these 10s added to the test duration? 10s sounds like a lot to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are not added. At the time that this gets called, the test has already completed.


logger.WithCallDepth(1).Info("WARNING: test kept at least one goroutine running after test completion", "timeSinceCompletion", duration.Round(time.Second), "callstack", string(dbg.Stacks(false)))
l.shared.goroutineWarningDone = true
}
}
return logger
}
Expand Down
15 changes: 10 additions & 5 deletions ktesting/testinglogger_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import (
"runtime"
"sync"
"testing"
"time"

"k8s.io/klog/v2"
"k8s.io/klog/v2/internal/test/require"
Expand Down Expand Up @@ -198,6 +199,7 @@ func TestStop(t *testing.T) {
_, _, line, _ = runtime.Caller(0)
logger.Info("simple info message")
logger.Error(nil, "error message")
time.Sleep(15 * time.Second)
logger.WithName("me").WithValues("completed", true).Info("complex info message", "anotherValue", 1)
}()
})
Expand All @@ -212,11 +214,14 @@ func TestStop(t *testing.T) {
// Strip time and pid prefix.
actual = regexp.MustCompile(`(?m)^.* testinglogger_test.go:`).ReplaceAllString(actual, `testinglogger_test.go:`)

// Strip duration.
actual = regexp.MustCompile(`timeSinceCompletion="\d+s"`).ReplaceAllString(actual, `timeSinceCompletion="<...>s"`)

// All lines from the callstack get stripped. We can be sure that it was non-empty because otherwise we wouldn't
// have the < > markers.
//
// Full output:
// testinglogger_test.go:194] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine" callstack=<
// testinglogger_test.go:194] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine.me" completed=true timeSinceCompletion="15s" callstack=<
// goroutine 23 [running]:
// k8s.io/klog/v2/internal/dbg.Stacks(0x0)
// /nvme/gopath/src/k8s.io/klog/internal/dbg/dbg.go:34 +0x8a
Expand All @@ -233,13 +238,13 @@ func TestStop(t *testing.T) {
// >
actual = regexp.MustCompile(`(?m)^\t.*?\n`).ReplaceAllString(actual, ``)

expected := fmt.Sprintf(`testinglogger_test.go:%d] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine" callstack=<
>
testinglogger_test.go:%d] "simple info message" logger="TestStop/Sub leaked goroutine"
expected := fmt.Sprintf(`testinglogger_test.go:%d] "simple info message" logger="TestStop/Sub leaked goroutine"
testinglogger_test.go:%d] "error message" logger="TestStop/Sub leaked goroutine"
testinglogger_test.go:%d] "WARNING: test kept at least one goroutine running after test completion" logger="TestStop/Sub leaked goroutine.me" completed=true timeSinceCompletion="<...>s" callstack=<
>
testinglogger_test.go:%d] "complex info message" logger="TestStop/Sub leaked goroutine.me" completed=true anotherValue=1
`,
line+1, line+1, line+2, line+3)
line+1, line+2, line+4, line+4)
if actual != expected {
t.Errorf("Output does not match. Expected:\n%s\nActual:\n%s\n", expected, actual)
}
Expand Down
Loading