-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "memory high" upscaling via cgroups #30
Conversation
s.runner.logger.Warningf("%s", internalErr) | ||
|
||
// To be nice, we'll restart the server. We don't want to make a temporary error permanent. | ||
s.exit(InformantServerExitStatus{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment makes it seem like this method should be called "restart" or something?
"RequestUpscale called for Agent %s/%s that is already unregistered (probably *not* a race?)", | ||
a.serverAddr, a.id, | ||
) | ||
handleError(context.Canceled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a weird sentinel error to pass around, but sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any recommendations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just declare a sentinel error in this named of ErrRequestConflictTimeout
or something similar.
pkg/informant/agent.go
Outdated
if errors.Is(err, context.Canceled) { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about deadline exceeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the context's cancelled only when the agent unregisters itself (and maybe when a new agent replaces it? can't remember OTOH) — so that's somewhat expected. Exceeding the deadline is not expected behavior.
Might be misremembering. I'll have to double-check.
23e7b4c
to
bfbf987
Compare
Pushed some updates. To get it working, opened neondatabase/neonvm#33 — blocked on that before merging. |
2ff0a7c
to
f9408ba
Compare
Spent a while dealing with issues from an over-eager OOM killer after memory hotplug failure, but it seems like fixing some related VM informant bugs stopped that from happening - even though they couldn't possibly be the cause. |
Required for #30. This version of NeonVM also now has networking, so the various bits of networking from pre-NeonVM days were updated. Properly switching back to all that will have to come with re-enabling migration.
Required for #30. This version of NeonVM also now has networking, so the various bits of networking from pre-NeonVM days were removed. Properly switching back to all that will wait until migration is re-enabled.
Also fixes the documentation on the method, so it more closely matches the actual behavior.
a35a561
to
a330e21
Compare
Ok, think this is good to merge. I'll give it a once-over before merging tomorrow, with a follow-up PR to https://github.com/neondatabase/neon to add cgroup handling there as well. This feature has some unfortunate interactions with other open issues, particularly:
I believe both of these require protocol changes, alongside some of the other changes from #27. |
High-level features added:
/try-upscale
endpoint, takingapi.MoreResources
— VM informant can request upscaling. Refer to the changes to ARCHITECTURE.md for a brief overview./try-upscale
.Prior issues fixed:
Remaining tasks:
Follow-up tasks:
compute_ctl