Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neonvm-controller: improve node failure reaction speed #1055

Merged
merged 3 commits into from
Oct 15, 2024
Merged

Conversation

Omrigan
Copy link
Contributor

@Omrigan Omrigan commented Aug 29, 2024

Multiple commits, their descriptions:

  • neonvm: recreate the pod for VM even if the older pod still exists

    We have previously assumed running multiple pods for one VM is
    dangerous.

    We now believe it should be fine, even if the old pod still tries to
    continue operating. What would happen is that the new pod would take
    over the safekeepers quorum, leaving the old one to disconnect and
    shutdown.

  • neonvm: check if deletion timestamp is in the past

    If we are passed the mark of the deletion timestamp, it means
    the deletion is stuck, and we should consider the pod failed anyway.

    Possible reasons for this are:

    1. Node is down.
    2. Pod is stuck pulling the image from the container registry.
  • neonvm: add explicit "node not ready" tolerations with 30s grace period

    By default they are 300s (5m), which is way too long.

Part of the https://github.com/neondatabase/cloud/issues/14114

@Omrigan Omrigan force-pushed the oleg/node-down-2 branch 2 times, most recently from e27cbd4 to 35bab49 Compare August 29, 2024 14:01
@Omrigan Omrigan changed the base branch from main to oleg/revert-crictl August 29, 2024 14:24
@Omrigan Omrigan force-pushed the oleg/node-down-2 branch 2 times, most recently from 95fbd47 to b4cd9ad Compare September 25, 2024 12:32
@Omrigan Omrigan marked this pull request as ready for review September 25, 2024 12:35
@Omrigan Omrigan force-pushed the oleg/node-down-2 branch 2 times, most recently from 5b1c98c to ba0f505 Compare September 25, 2024 13:22
@sharnoff
Copy link
Member

sharnoff commented Oct 7, 2024

@Omrigan IIUC this PR needs rebasing?

Base automatically changed from oleg/revert-crictl to main October 7, 2024 22:21
@Omrigan Omrigan changed the title Improve node failure reaction speed neonvm: improve node failure reaction speed Oct 7, 2024
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good, I think.

The one major item I'd like to see is to have this feature-gated behind some CLI flag. Reasoning is that this type of change is at higher risk of cascading failures (e.g., if restarting causes us to trigger even more restarts) — so we should have an escape hatch, just in case.

Other than that, a couple admin notes:

  1. Some of the changes are internal to neonvm-controller ; could those commits be titled with neonvm-controller: as the prefix?
  2. If you plan to rebase-and-merge, could you edit the commit titles to add the PR number before merging? (i.e. appending (#1055) so it looks similar to squash-and-merge)

neonvm/controllers/vm_controller.go Outdated Show resolved Hide resolved
@Omrigan
Copy link
Contributor Author

Omrigan commented Oct 11, 2024

2. If you plan to rebase-and-merge, could you edit the commit titles to add the PR number before merging? (i.e. appending `(#1055)` so it looks similar to squash-and-merge)

I was thinking maybe it we don't really need it?

From the history, if you click on commit, then the commit will have a PR link.

So editing the commits saves us 1 click.

@Omrigan Omrigan changed the title neonvm: improve node failure reaction speed neonvm-controller: improve node failure reaction speed Oct 11, 2024
@sharnoff
Copy link
Member

I was thinking maybe it we don't really need it?

From the history, if you click on commit, then the commit will have a PR link.

It's very useful to have that information available when interacting with git locally

Omrigan added a commit that referenced this pull request Oct 14, 2024
… grace period (#1055)

By default they are 300s (5m), which is way too long.

Signed-off-by: Oleg Vasilev <[email protected]>
Omrigan added a commit that referenced this pull request Oct 14, 2024
If we are passed the mark of the deletion timestamp, it means
the deletion is stuck, and we should consider the pod to be failed anyway.

Possible reasons for this are:
1. Node is down.
2. Pod is stuck pulling the image from the container registry.

Signed-off-by: Oleg Vasilev <[email protected]>
Omrigan added a commit that referenced this pull request Oct 14, 2024
…l exists (#1055)

We have previously assumed running multiple pods for one VM is
dangerous.

We now believe it should be fine, even if the old pod still tries to
continue operating. What would happen is that the new pod would take
over the safekeepers quorum, leaving the old one to disconnect and
shutdown.

Signed-off-by: Oleg Vasilev <[email protected]>
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! two comments.

Comment on lines 937 to 948
// Add 5 seconds to account for clock skew and k8s lagging behind.
deadline := metav1.NewTime(metav1.Now().Add(-5 * time.Second))

if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) {
return runnerFailed
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a brief comment here explaining what we're doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

// Add 5 seconds to account for clock skew and k8s lagging behind.
deadline := metav1.NewTime(metav1.Now().Add(-5 * time.Second))

if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) {
if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(&deadline) {

? (up to you)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@sharnoff sharnoff assigned Omrigan and unassigned sharnoff Oct 14, 2024
… grace period (#1055)

By default they are 300s (5m), which is way too long.

Signed-off-by: Oleg Vasilev <[email protected]>
If we are passed the mark of the deletion timestamp, it means
the deletion is stuck, and we should consider the pod to be failed anyway.

Possible reasons for this are:
1. Node is down.
2. Pod is stuck pulling the image from the container registry.

Signed-off-by: Oleg Vasilev <[email protected]>
…l exists (#1055)

We have previously assumed running multiple pods for one VM is
dangerous.

We now believe it should be fine, even if the old pod still tries to
continue operating. What would happen is that the new pod would take
over the safekeepers quorum, leaving the old one to disconnect and
shutdown.

Signed-off-by: Oleg Vasilev <[email protected]>
@Omrigan Omrigan enabled auto-merge (rebase) October 15, 2024 21:20
Copy link

github-actions bot commented Oct 15, 2024

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/neondatabase/autoscaling/neonvm-controller/cmd 0.00% (ø)
github.com/neondatabase/autoscaling/pkg/neonvm/controllers 11.79% (+0.28%) 👍
github.com/neondatabase/autoscaling/pkg/neonvm/controllers/functests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/neondatabase/autoscaling/neonvm-controller/cmd/main.go 0.00% (ø) 132 (+2) 0 132 (+2)
github.com/neondatabase/autoscaling/pkg/neonvm/controllers/config.go 0.00% (ø) 0 0 0
github.com/neondatabase/autoscaling/pkg/neonvm/controllers/vm_controller.go 25.87% (+0.64%) 661 (-1) 171 (+4) 490 (-5) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/neondatabase/autoscaling/pkg/neonvm/controllers/functests/vm_controller_test.go
  • github.com/neondatabase/autoscaling/pkg/neonvm/controllers/vm_controller_unit_test.go

HTML Report

Click to open

@Omrigan Omrigan merged commit 5e5465b into main Oct 15, 2024
22 checks passed
@Omrigan Omrigan deleted the oleg/node-down-2 branch October 15, 2024 21:34
Omrigan added a commit that referenced this pull request Oct 15, 2024
… grace period (#1055)

By default they are 300s (5m), which is way too long.

Signed-off-by: Oleg Vasilev <[email protected]>
Omrigan added a commit that referenced this pull request Oct 15, 2024
If we are passed the mark of the deletion timestamp, it means
the deletion is stuck, and we should consider the pod to be failed anyway.

Possible reasons for this are:
1. Node is down.
2. Pod is stuck pulling the image from the container registry.

Signed-off-by: Oleg Vasilev <[email protected]>
sharnoff added a commit that referenced this pull request Nov 18, 2024
Similar to what was done in #1055, we need to explicitly add tolerations
to the scheduler to get it to be recreated more quickly on node failure.

This is particularly necessary because we don't have #955. We could wait
for that, but it's a lot of work, and this is a small thing we can do in
the meantime.

Fixes neondatabase/cloud#17298.
sharnoff added a commit that referenced this pull request Nov 19, 2024
Similar to what was done in #1055, we need to explicitly add tolerations
to the scheduler to get it to be recreated more quickly on node failure.

This is particularly necessary because we don't have #995. We could wait
for that, but it's a lot of work, and this is a small thing we can do in
the meantime.

Fixes neondatabase/cloud#17298, part of neondatabase/cloud#14114.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants