Add "memory high" upscaling via cgroups #30

sharnoff · 2023-01-30T02:11:41Z

High-level features added:

agent: /try-upscale endpoint, taking api.MoreResources — VM informant can request upscaling. Refer to the changes to ARCHITECTURE.md for a brief overview.
informant: off-by-default cgroup memory.high event tracking — triggers /try-upscale.

Prior issues fixed:

VM informant will deny downscaling if it's too close to triggering a memory.high event (fixes the "downscaling below current usage" issue).

Remaining tasks:

Add a little "allocate a bunch of memory" test program for the test VM
Actually test that it works (and fix the bugs) — with and without cgroup handling enabled.

Follow-up tasks:

Integrate the cgroups stuff into compute_ctl

cmd/vm-informant/main.go

pkg/agent/informant.go

tychoish · 2023-01-30T17:40:23Z

pkg/agent/informant.go

+		s.runner.logger.Warningf("%s", internalErr)
+
+		// To be nice, we'll restart the server. We don't want to make a temporary error permanent.
+		s.exit(InformantServerExitStatus{


the comment makes it seem like this method should be called "restart" or something?

pkg/agent/runner.go

pkg/informant/cgroup.go

tychoish · 2023-01-30T19:29:15Z

pkg/informant/agent.go

+			"RequestUpscale called for Agent %s/%s that is already unregistered (probably *not* a race?)",
+			a.serverAddr, a.id,
+		)
+		handleError(context.Canceled)


this seems like a weird sentinel error to pass around, but sure.

Any recommendations?

just declare a sentinel error in this named of ErrRequestConflictTimeout or something similar.

tychoish · 2023-01-30T19:29:20Z

pkg/informant/agent.go

+			if errors.Is(err, context.Canceled) {
+				return
+			}


what about deadline exceeded?

IIRC the context's cancelled only when the agent unregisters itself (and maybe when a new agent replaces it? can't remember OTOH) — so that's somewhat expected. Exceeding the deadline is not expected behavior.

Might be misremembering. I'll have to double-check.

pkg/api/types.go

sharnoff · 2023-02-01T04:19:28Z

Pushed some updates. To get it working, opened neondatabase/neonvm#33 — blocked on that before merging.

sharnoff · 2023-02-06T03:14:52Z

Spent a while dealing with issues from an over-eager OOM killer after memory hotplug failure, but it seems like fixing some related VM informant bugs stopped that from happening - even though they couldn't possibly be the cause.

Required for #30. This version of NeonVM also now has networking, so the various bits of networking from pre-NeonVM days were updated. Properly switching back to all that will have to come with re-enabling migration.

Required for #30. This version of NeonVM also now has networking, so the various bits of networking from pre-NeonVM days were removed. Properly switching back to all that will wait until migration is re-enabled.

Also fixes the documentation on the method, so it more closely matches the actual behavior.

sharnoff · 2023-02-07T05:08:31Z

Ok, think this is good to merge. I'll give it a once-over before merging tomorrow, with a follow-up PR to https://github.com/neondatabase/neon to add cgroup handling there as well.

This feature has some unfortunate interactions with other open issues, particularly:

Our current metrics and scaling algorithm mean that it's hard to scale back down after increasing due to /try-upscale requests (if it can't go all the way back down to the level indicated by load average, it won't go down at all)
Some of the raciness around NeonVM means we have to "trust" what the autoscaler-agent says when it calls the /upscale endpoint (see: agent vs NeonVM state is inconsistent when NeonVM fails #23 (comment), point (2))

I believe both of these require protocol changes, alongside some of the other changes from #27.

sharnoff requested a review from tychoish January 30, 2023 02:11

sharnoff mentioned this pull request Jan 30, 2023

Epic: make autoscaling production-ready #2

Closed

42 tasks

tychoish reviewed Jan 30, 2023

View reviewed changes

sharnoff force-pushed the sharnoff/cgroup-upscaling branch from 23e7b4c to bfbf987 Compare February 1, 2023 04:05

sharnoff force-pushed the sharnoff/cgroup-upscaling branch 2 times, most recently from 2ff0a7c to f9408ba Compare February 6, 2023 03:12

sharnoff marked this pull request as ready for review February 6, 2023 03:14

sharnoff mentioned this pull request Feb 6, 2023

Bump NeonVM v0.3.5 -> v0.4.4 #36

Merged

sharnoff added 15 commits February 6, 2023 11:46

Add "memory high" upscaling via cgroups

e86e0e8

Fix some misspellings

da64285

informant: add comment about null string arg default

3bff24e

vm_image: add missing vm-informant retry delay

f42899c

misc fixes

23ca19a

Fixes (and bugfixes) from exhaustruct

854817d

Rework memory.events parsing

bd37a8f

Remove unused (api.MoreResources).Or()

bcf369e

informant: Rename SpawnSuspend -> Suspend

ca955ed

Also fixes the documentation on the method, so it more closely matches the actual behavior.

More fixes

cc08e0f

rework cgroup name flag + comment

504b80a

Fix missing continue in memory.events parsing

ebbc516

Change float formatting %v -> %g

f4f37b6

Fix util.AtomicMax comment

6980c3e

Bump containerd/cgroups

a330e21

sharnoff force-pushed the sharnoff/cgroup-upscaling branch from a35a561 to a330e21 Compare February 6, 2023 20:07

sharnoff added 2 commits February 6, 2023 12:26

Fix vm-builder flags for cgroup v2

2bbb410

Bump agent<->informant version, add version handling

8755b85

sharnoff added 2 commits February 7, 2023 12:34

Fix typos

dd26ab1

Final touch-ups

10b5075

sharnoff merged commit 1c17415 into main Feb 7, 2023

bayandin pushed a commit that referenced this pull request Feb 23, 2023

Switch version tags from X.Y.Z -> vX.Y.Z (#30)

5de3504

sharnoff deleted the sharnoff/cgroup-upscaling branch February 28, 2023 23:11

sharnoff mentioned this pull request Mar 6, 2023

Follow-up tasks from #25 #27

Closed

14 tasks

bayandin pushed a commit that referenced this pull request Mar 22, 2023

Switch version tags from X.Y.Z -> vX.Y.Z (#30)

be5c0c1

bayandin pushed a commit that referenced this pull request Mar 22, 2023

Switch version tags from X.Y.Z -> vX.Y.Z (#30)

7e5544c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "memory high" upscaling via cgroups #30

Add "memory high" upscaling via cgroups #30

sharnoff commented Jan 30, 2023 •

edited

Loading

tychoish Jan 30, 2023

tychoish Jan 30, 2023

sharnoff Jan 31, 2023

tychoish Jan 31, 2023

tychoish Jan 30, 2023

sharnoff Jan 31, 2023

sharnoff commented Feb 1, 2023

sharnoff commented Feb 6, 2023

sharnoff commented Feb 7, 2023

Add "memory high" upscaling via cgroups #30

Add "memory high" upscaling via cgroups #30

Conversation

sharnoff commented Jan 30, 2023 • edited Loading

tychoish Jan 30, 2023

Choose a reason for hiding this comment

tychoish Jan 30, 2023

Choose a reason for hiding this comment

sharnoff Jan 31, 2023

Choose a reason for hiding this comment

tychoish Jan 31, 2023

Choose a reason for hiding this comment

tychoish Jan 30, 2023

Choose a reason for hiding this comment

sharnoff Jan 31, 2023

Choose a reason for hiding this comment

sharnoff commented Feb 1, 2023

sharnoff commented Feb 6, 2023

sharnoff commented Feb 7, 2023

sharnoff commented Jan 30, 2023 •

edited

Loading