Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 2023-10-20: vm-monitor memory.high throttling fixes #5610

Merged
merged 3 commits into from
Oct 20, 2023

Conversation

sharnoff
Copy link
Member

This release PR exists to fast-track deploying the fix(es) for #5444.

The plan is to just use this release to build images, and then manually update the versions of the pageservers (and safekeepers?) in the cplane db, region-by-region, to switch to these images.

tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
There's currently an issue with the vm-monitor on staging that's not
really feasible to debug because the current display impl gives no
context to the errors (just says "failed to downscale").

Logging the full error should help.

For communications with the autoscaler-agent, it's ok to only provide
the outermost cause, because we can cross-reference with the VM logs.
At some point in the future, we may want to change that.
Fixes an issue we observed on staging that happens when the
autoscaler-agent attempts to immediately downscale the VM after binding,
which is typical for pooled computes.

The issue was occurring because the autoscaler-agent was requesting
downscaling before the vm-monitor had gathered sufficient cgroup memory
stats to be confident in approving it. When the vm-monitor returned an
internal error instead of denying downscaling, the autoscaler-agent
retried the connection and immediately hit the same issue (in part
because cgroup stats are collected per-connection, rather than
globally).
@sharnoff sharnoff requested a review from vadim2404 October 20, 2023 05:20
@sharnoff sharnoff requested review from a team as code owners October 20, 2023 05:20
@sharnoff sharnoff enabled auto-merge October 20, 2023 05:22
@github-actions
Copy link

2292 tests run: 2175 passed, 1 failed, 116 skipped (full report)


Failures on Postgres 14

  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-delete]: release
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-delete]"
Flaky tests (1)

Postgres 16

  • test_pageserver_lsn_wait_error_safekeeper_stop: debug

Test coverage report is not available

The comment gets automatically updated with the latest test results
850db4c at 2023-10-20T05:49:47.290Z :recycle:

@sharnoff sharnoff merged commit e614a95 into release Oct 20, 2023
31 of 32 checks passed
@sharnoff sharnoff deleted the sharnoff/rc-2023-10-20-vm-monitor-fixes branch October 20, 2023 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants