Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "memory high" upscaling via cgroups #30

Merged
merged 19 commits into from
Feb 7, 2023
Merged

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Jan 30, 2023

High-level features added:

  • agent: /try-upscale endpoint, taking api.MoreResources — VM informant can request upscaling. Refer to the changes to ARCHITECTURE.md for a brief overview.
  • informant: off-by-default cgroup memory.high event tracking — triggers /try-upscale.

Prior issues fixed:

  • VM informant will deny downscaling if it's too close to triggering a memory.high event (fixes the "downscaling below current usage" issue).

Remaining tasks:

  • Add a little "allocate a bunch of memory" test program for the test VM
  • Actually test that it works (and fix the bugs) — with and without cgroup handling enabled.

Follow-up tasks:

  • Integrate the cgroups stuff into compute_ctl

@sharnoff sharnoff requested a review from tychoish January 30, 2023 02:11
@sharnoff sharnoff mentioned this pull request Jan 30, 2023
42 tasks
cmd/vm-informant/main.go Outdated Show resolved Hide resolved
cmd/vm-informant/main.go Outdated Show resolved Hide resolved
pkg/agent/informant.go Show resolved Hide resolved
s.runner.logger.Warningf("%s", internalErr)

// To be nice, we'll restart the server. We don't want to make a temporary error permanent.
s.exit(InformantServerExitStatus{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment makes it seem like this method should be called "restart" or something?

pkg/agent/runner.go Show resolved Hide resolved
pkg/informant/cgroup.go Show resolved Hide resolved
pkg/informant/cgroup.go Outdated Show resolved Hide resolved
"RequestUpscale called for Agent %s/%s that is already unregistered (probably *not* a race?)",
a.serverAddr, a.id,
)
handleError(context.Canceled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a weird sentinel error to pass around, but sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any recommendations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just declare a sentinel error in this named of ErrRequestConflictTimeout or something similar.

Comment on lines 364 to 366
if errors.Is(err, context.Canceled) {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about deadline exceeded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the context's cancelled only when the agent unregisters itself (and maybe when a new agent replaces it? can't remember OTOH) — so that's somewhat expected. Exceeding the deadline is not expected behavior.

Might be misremembering. I'll have to double-check.

pkg/api/types.go Show resolved Hide resolved
@sharnoff sharnoff force-pushed the sharnoff/cgroup-upscaling branch from 23e7b4c to bfbf987 Compare February 1, 2023 04:05
@sharnoff
Copy link
Member Author

sharnoff commented Feb 1, 2023

Pushed some updates. To get it working, opened neondatabase/neonvm#33 — blocked on that before merging.

@sharnoff sharnoff force-pushed the sharnoff/cgroup-upscaling branch 2 times, most recently from 2ff0a7c to f9408ba Compare February 6, 2023 03:12
@sharnoff sharnoff marked this pull request as ready for review February 6, 2023 03:14
@sharnoff
Copy link
Member Author

sharnoff commented Feb 6, 2023

Spent a while dealing with issues from an over-eager OOM killer after memory hotplug failure, but it seems like fixing some related VM informant bugs stopped that from happening - even though they couldn't possibly be the cause.

sharnoff added a commit that referenced this pull request Feb 6, 2023
Required for #30. This version of NeonVM also now has networking, so the
various bits of networking from pre-NeonVM days were updated. Properly
switching back to all that will have to come with re-enabling migration.
sharnoff added a commit that referenced this pull request Feb 6, 2023
Required for #30. This version of NeonVM also now has networking, so the
various bits of networking from pre-NeonVM days were removed. Properly
switching back to all that will wait until migration is re-enabled.
@sharnoff sharnoff force-pushed the sharnoff/cgroup-upscaling branch from a35a561 to a330e21 Compare February 6, 2023 20:07
@sharnoff
Copy link
Member Author

sharnoff commented Feb 7, 2023

Ok, think this is good to merge. I'll give it a once-over before merging tomorrow, with a follow-up PR to https://github.com/neondatabase/neon to add cgroup handling there as well.

This feature has some unfortunate interactions with other open issues, particularly:

  1. Our current metrics and scaling algorithm mean that it's hard to scale back down after increasing due to /try-upscale requests (if it can't go all the way back down to the level indicated by load average, it won't go down at all)
  2. Some of the raciness around NeonVM means we have to "trust" what the autoscaler-agent says when it calls the /upscale endpoint (see: agent vs NeonVM state is inconsistent when NeonVM fails #23 (comment), point (2))

I believe both of these require protocol changes, alongside some of the other changes from #27.

@sharnoff sharnoff merged commit 1c17415 into main Feb 7, 2023
bayandin pushed a commit that referenced this pull request Feb 23, 2023
@sharnoff sharnoff deleted the sharnoff/cgroup-upscaling branch February 28, 2023 23:11
@sharnoff sharnoff mentioned this pull request Mar 6, 2023
14 tasks
bayandin pushed a commit that referenced this pull request Mar 22, 2023
bayandin pushed a commit that referenced this pull request Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants