Release 2023-10-20: vm-monitor memory.high throttling fixes #5610

sharnoff · 2023-10-20T05:20:37Z

This release PR exists to fast-track deploying the fix(es) for #5444.

The plan is to just use this release to build images, and then manually update the versions of the pageservers (and safekeepers?) in the cplane db, region-by-region, to switch to these images.

tl;dr it's really hard to avoid throttling from memory.high, and it counts tmpfs & page cache usage, so it's also hard to make sense of. In the interest of fixing things quickly with something that should be *good enough*, this PR switches to instead periodically fetch memory statistics from the cgroup's memory.stat and use that data to determine if and when we should upscale. This PR fixes #5444, which has a lot more detail on the difficulties we've hit with memory.high. This PR also supersedes #5488.

There's currently an issue with the vm-monitor on staging that's not really feasible to debug because the current display impl gives no context to the errors (just says "failed to downscale"). Logging the full error should help. For communications with the autoscaler-agent, it's ok to only provide the outermost cause, because we can cross-reference with the VM logs. At some point in the future, we may want to change that.

Fixes an issue we observed on staging that happens when the autoscaler-agent attempts to immediately downscale the VM after binding, which is typical for pooled computes. The issue was occurring because the autoscaler-agent was requesting downscaling before the vm-monitor had gathered sufficient cgroup memory stats to be confident in approving it. When the vm-monitor returned an internal error instead of denying downscaling, the autoscaler-agent retried the connection and immediately hit the same issue (in part because cgroup stats are collected per-connection, rather than globally).

github-actions · 2023-10-20T05:49:48Z

2292 tests run: 2175 passed, 1 failed, 116 skipped (full report)

Failures on Postgres 14

test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-delete]: release

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-delete]"

Flaky tests (1)

Postgres 16

test_pageserver_lsn_wait_error_safekeeper_stop: debug

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
850db4c at 2023-10-20T05:49:47.290Z :recycle:}

sharnoff added 3 commits October 19, 2023 21:56

sharnoff requested a review from vadim2404 October 20, 2023 05:20

sharnoff requested review from a team as code owners October 20, 2023 05:20

sharnoff enabled auto-merge October 20, 2023 05:22

vadim2404 approved these changes Oct 20, 2023

View reviewed changes

sharnoff merged commit e614a95 into release Oct 20, 2023
31 of 32 checks passed

sharnoff deleted the sharnoff/rc-2023-10-20-vm-monitor-fixes branch October 20, 2023 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2023-10-20: vm-monitor memory.high throttling fixes #5610

Release 2023-10-20: vm-monitor memory.high throttling fixes #5610

sharnoff commented Oct 20, 2023

github-actions bot commented Oct 20, 2023

Postgres 16

Release 2023-10-20: vm-monitor memory.high throttling fixes #5610

Release 2023-10-20: vm-monitor memory.high throttling fixes #5610

Conversation

sharnoff commented Oct 20, 2023

github-actions bot commented Oct 20, 2023

2292 tests run: 2175 passed, 1 failed, 116 skipped (full report)

Failures on Postgres 14

Postgres 16

Test coverage report is not available