Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: plumb contexts through #59

Merged
merged 6 commits into from
Mar 1, 2023
Merged

Conversation

tychoish
Copy link
Contributor

In persuit of #55 and #12 (but certianly not all of either.)

@tychoish tychoish requested a review from sharnoff February 21, 2023 20:53
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some thoughts, left comments.

Separately: It seems like this should make it easy-ish to address #37 (basically, listen for context ending and trigger informant's server.exit).

pkg/agent/informant.go Outdated Show resolved Hide resolved
@@ -72,7 +70,7 @@ func (r MainRunner) Run() error {
globalState.Stop()
return nil
case event := <-podEvents:
globalState.handleEvent(event)
globalState.handleEvent(ctx, event)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels icky (handleEvent will spawn threads using the context long after handleEvent finishes, and we're always using the same context for handleEvent) - but the only better solution is storing the context in globalstate itself.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any problem with passing a context to a function that returns before the goroutines it spawns (and indeed having this context means that shutting down the agents main loop will actually cause a shutdown. Eventually/soon the basecontext/shutdown stuff can/will help make some of this more manageable.

Comment on lines +216 to +221
// we want shutdown to (potentially) live longer than the request which
// made it, but having a timeout is still good.
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

if err := server.unregisterFromInformant(ctx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unregisterFromInformant already uses a timeout on the request itself; do we need another one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that doInformantRequest takes a timeout and a context is a bit of a weird API, and I kind of planned to pull that apart in a later PR but I don't feel rushed about that..

it's plausible that we could just pass the the enclosing context to the unregister call, and not worry about it during shutdown, but.

cmd/autoscaler-agent/main.go Outdated Show resolved Hide resolved
runner.spawnBackgroundWorker(ctx, shutdownName, func(context.Context) {
// we want shutdown to (potentially) live longer than the request which
// made it, but having a timeout is still good.
ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 seconds is a super long timeout, compared to other things here (or: compared to our usual configuration of them). I'd either: make this shorter (e.g. 5s), add a config field for it, or calculate it from an existing config field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I think the question is more "how long could an HTTP request handler take to do a thing in normal operation and double(ish) it.

The cap on this is about 30s in my mind (which is probably just what the timeout handler was in System V init scripts between sigterm and sigkill if the process doesn't die, and which has definitely been carried further into the future.

@tychoish
Copy link
Contributor Author

Separately: It seems like this should make it easy-ish to address #37 (basically, listen for context ending and trigger informant's server.exit).

Yep! that's the hope.

Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the unresolved comments are mostly bikeshedding. Worst-case it's a little bit of tech debt that we already have plans to resolve.

Two suggested changes, mostly I think it'd be good to have context.TODO() so that it's more immediately obvious that the context for the background workers are disconnected.

pkg/agent/informant.go Outdated Show resolved Hide resolved
pkg/agent/informant.go Outdated Show resolved Hide resolved
@tychoish tychoish merged commit bd9595e into main Mar 1, 2023
@sharnoff
Copy link
Member

sharnoff commented Mar 4, 2023

This appears to be causing the autoscaler-agent to immediately crash on startup. Logs:

I0304 00:36:48.696058       1 main.go:31] Got environment args: {ConfigPath:/etc/autoscaler-agent-config/config.json K8sNodeName:autoscale-sched-worker K8sPodIP:10.244.1.11}
I0304 00:36:48.696600       1 entrypoint.go:29] buildInfo.GitInfo:   bd9595e (2023-03-01 16:33:16 +0000) - agent: plumb contexts through (#59)
I0304 00:36:48.696615       1 entrypoint.go:30] buildInfo.NeonVM:    v0.4.6
I0304 00:36:48.696619       1 entrypoint.go:31] buildInfo.GoVersion: go1.19.6
I0304 00:36:48.696624       1 entrypoint.go:33] Starting pod watcher
I0304 00:36:48.796260       1 entrypoint.go:38] Pod watcher started
I0304 00:36:48.796273       1 entrypoint.go:40] Starting VM watcher
I0304 00:36:48.798530       1 entrypoint.go:45] VM watcher started
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x139c6ed]

goroutine 86 [running]:
github.com/tychoish/fun/seq.(*elemIter[...]).Next(0xc000652ff0?, {0x1940458?, 0xc0000ed080?})
	/go/pkg/mod/github.com/tychoish/[email protected]/seq/list.go:438 +0x2d
github.com/tychoish/fun/set.syncIterImpl[...].Next({{0x19339d8?, 0xc0005993b8?}, {0x1934d10?, 0xc0003c4b30?}}, {0x1940458?, 0xc0000ed080})
	/go/pkg/mod/github.com/tychoish/[email protected]/set/set.go:149 +0xdb
github.com/tychoish/fun/pubsub.(*Broker[...]).dispatchMessage(0x1940458, {0x1940458?, 0xc0000ed080?}, {0x193fc30, 0xc000456640}, {{{{0xc000652ff0, 0x24}, {0xc000564e40, 0xb}}, {0xc000653020, ...}, ...}, ...})
	/go/pkg/mod/github.com/tychoish/[email protected]/pubsub/broker.go:202 +0x2f1
github.com/tychoish/fun/pubsub.(*Broker[...]).startQueueWorkers.func2()
	/go/pkg/mod/github.com/tychoish/[email protected]/pubsub/broker.go:182 +0x145
created by github.com/tychoish/fun/pubsub.(*Broker[...]).startQueueWorkers
	/go/pkg/mod/github.com/tychoish/[email protected]/pubsub/broker.go:175 +0x250

AFAICT this is exclusively caused by the dependency bump.

If this is indeed an issue with github.com/tychoish/fun, this PR should be reverted before merging anything else if a fix is not yet available.

sharnoff added a commit that referenced this pull request Mar 5, 2023
Fixes an issue causing autoscaler-agents to crash on startup, introduced
by #59.
@sharnoff
Copy link
Member

sharnoff commented Mar 5, 2023

Upgrading to v0.7.1 fixes the issue. Opened #73 to do so.

sharnoff added a commit that referenced this pull request Mar 5, 2023
Fixes an issue causing autoscaler-agents to crash on startup, see:
#59 (comment).
@sharnoff sharnoff deleted the tychoish/agent-context-handling branch March 7, 2023 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants