neonvm-controller: improve node failure reaction speed #1055

Omrigan · 2024-08-29T13:22:03Z

Multiple commits, their descriptions:

neonvm: recreate the pod for VM even if the older pod still exists

We have previously assumed running multiple pods for one VM is
dangerous.

We now believe it should be fine, even if the old pod still tries to
continue operating. What would happen is that the new pod would take
over the safekeepers quorum, leaving the old one to disconnect and
shutdown.
neonvm: check if deletion timestamp is in the past

If we are passed the mark of the deletion timestamp, it means
the deletion is stuck, and we should consider the pod failed anyway.

Possible reasons for this are:
1. Node is down.
2. Pod is stuck pulling the image from the container registry.
neonvm: add explicit "node not ready" tolerations with 30s grace period

By default they are 300s (5m), which is way too long.

Part of the https://github.com/neondatabase/cloud/issues/14114

sharnoff · 2024-10-07T18:16:25Z

@Omrigan IIUC this PR needs rebasing?

sharnoff

Broadly looks good, I think.

The one major item I'd like to see is to have this feature-gated behind some CLI flag. Reasoning is that this type of change is at higher risk of cascading failures (e.g., if restarting causes us to trigger even more restarts) — so we should have an escape hatch, just in case.

Other than that, a couple admin notes:

Some of the changes are internal to neonvm-controller ; could those commits be titled with neonvm-controller: as the prefix?
If you plan to rebase-and-merge, could you edit the commit titles to add the PR number before merging? (i.e. appending (#1055) so it looks similar to squash-and-merge)

neonvm/controllers/vm_controller.go

Omrigan · 2024-10-11T15:23:04Z

2. If you plan to rebase-and-merge, could you edit the commit titles to add the PR number before merging? (i.e. appending `(#1055)` so it looks similar to squash-and-merge)

I was thinking maybe it we don't really need it?

From the history, if you click on commit, then the commit will have a PR link.

So editing the commits saves us 1 click.

sharnoff · 2024-10-11T15:24:52Z

I was thinking maybe it we don't really need it?

From the history, if you click on commit, then the commit will have a PR link.

It's very useful to have that information available when interacting with git locally

… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>

If we are passed the mark of the deletion timestamp, it means the deletion is stuck, and we should consider the pod to be failed anyway. Possible reasons for this are: 1. Node is down. 2. Pod is stuck pulling the image from the container registry. Signed-off-by: Oleg Vasilev <[email protected]>

…l exists (#1055) We have previously assumed running multiple pods for one VM is dangerous. We now believe it should be fine, even if the old pod still tries to continue operating. What would happen is that the new pod would take over the safekeepers quorum, leaving the old one to disconnect and shutdown. Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff

LGTM! two comments.

sharnoff · 2024-10-14T17:02:29Z

pkg/neonvm/controllers/vm_controller.go

+	// Add 5 seconds to account for clock skew and k8s lagging behind.
+	deadline := metav1.NewTime(metav1.Now().Add(-5 * time.Second))
+
+	if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) {
+		return runnerFailed
+	}


Could you add a brief comment here explaining what we're doing?

sharnoff · 2024-10-14T17:02:44Z

pkg/neonvm/controllers/vm_controller.go

+	// Add 5 seconds to account for clock skew and k8s lagging behind.
+	deadline := metav1.NewTime(metav1.Now().Add(-5 * time.Second))
+
+	if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) {


Suggested change

if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) {

if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(&deadline) {

? (up to you)

… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>

If we are passed the mark of the deletion timestamp, it means the deletion is stuck, and we should consider the pod to be failed anyway. Possible reasons for this are: 1. Node is down. 2. Pod is stuck pulling the image from the container registry. Signed-off-by: Oleg Vasilev <[email protected]>

…l exists (#1055) We have previously assumed running multiple pods for one VM is dangerous. We now believe it should be fine, even if the old pod still tries to continue operating. What would happen is that the new pod would take over the safekeepers quorum, leaving the old one to disconnect and shutdown. Signed-off-by: Oleg Vasilev <[email protected]>

github-actions · 2024-10-15T21:23:36Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/neondatabase/autoscaling/neonvm-controller/cmd	0.00% (ø)
github.com/neondatabase/autoscaling/pkg/neonvm/controllers	11.79% (+0.28%)	👍
github.com/neondatabase/autoscaling/pkg/neonvm/controllers/functests	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/neondatabase/autoscaling/neonvm-controller/cmd/main.go	0.00% (ø)	132 (+2)	0	132 (+2)
github.com/neondatabase/autoscaling/pkg/neonvm/controllers/config.go	0.00% (ø)	0	0	0
github.com/neondatabase/autoscaling/pkg/neonvm/controllers/vm_controller.go	25.87% (+0.64%)	661 (-1)	171 (+4)	490 (-5)	👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/neondatabase/autoscaling/pkg/neonvm/controllers/functests/vm_controller_test.go
github.com/neondatabase/autoscaling/pkg/neonvm/controllers/vm_controller_unit_test.go

HTML Report

Click to open

… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>

If we are passed the mark of the deletion timestamp, it means the deletion is stuck, and we should consider the pod to be failed anyway. Possible reasons for this are: 1. Node is down. 2. Pod is stuck pulling the image from the container registry. Signed-off-by: Oleg Vasilev <[email protected]>

Similar to what was done in #1055, we need to explicitly add tolerations to the scheduler to get it to be recreated more quickly on node failure. This is particularly necessary because we don't have #955. We could wait for that, but it's a lot of work, and this is a small thing we can do in the meantime. Fixes neondatabase/cloud#17298.

Similar to what was done in #1055, we need to explicitly add tolerations to the scheduler to get it to be recreated more quickly on node failure. This is particularly necessary because we don't have #995. We could wait for that, but it's a lot of work, and this is a small thing we can do in the meantime. Fixes neondatabase/cloud#17298, part of neondatabase/cloud#14114.

Omrigan force-pushed the oleg/node-down-2 branch 2 times, most recently from e27cbd4 to 35bab49 Compare August 29, 2024 14:01

Omrigan changed the base branch from main to oleg/revert-crictl August 29, 2024 14:24

Omrigan force-pushed the oleg/node-down-2 branch from 35bab49 to 837f32e Compare August 29, 2024 14:25

Omrigan force-pushed the oleg/revert-crictl branch from 788b5b4 to 1ff1fd0 Compare September 19, 2024 12:43

Omrigan force-pushed the oleg/node-down-2 branch from a1e7ea4 to e2a2389 Compare September 19, 2024 13:26

Omrigan force-pushed the oleg/revert-crictl branch from 454070e to 23e9957 Compare September 24, 2024 09:42

Omrigan force-pushed the oleg/node-down-2 branch 2 times, most recently from 95fbd47 to b4cd9ad Compare September 25, 2024 12:32

Omrigan marked this pull request as ready for review September 25, 2024 12:35

Omrigan requested review from sharnoff, mikhail-sakhnov and petuhovskiy September 25, 2024 12:35

Omrigan assigned sharnoff Sep 25, 2024

Omrigan force-pushed the oleg/revert-crictl branch from 23e9957 to a7fbb7a Compare September 25, 2024 12:42

Omrigan force-pushed the oleg/node-down-2 branch 2 times, most recently from 5b1c98c to ba0f505 Compare September 25, 2024 13:22

Omrigan force-pushed the oleg/revert-crictl branch from a7fbb7a to f1deb9d Compare September 30, 2024 09:33

Base automatically changed from oleg/revert-crictl to main October 7, 2024 22:21

Omrigan force-pushed the oleg/node-down-2 branch from ba0f505 to 01ab8d9 Compare October 7, 2024 23:00

Omrigan changed the title ~~Improve node failure reaction speed~~ neonvm: improve node failure reaction speed Oct 7, 2024

Omrigan force-pushed the oleg/node-down-2 branch from 01ab8d9 to 44eee4e Compare October 7, 2024 23:04

mikhail-sakhnov approved these changes Oct 9, 2024

View reviewed changes

sharnoff reviewed Oct 9, 2024

View reviewed changes

neonvm/controllers/vm_controller.go Outdated Show resolved Hide resolved

Omrigan changed the title ~~neonvm: improve node failure reaction speed~~ neonvm-controller: improve node failure reaction speed Oct 11, 2024

Omrigan added a commit that referenced this pull request Oct 14, 2024

neonvm-controller: add explicit "node not ready" tolerations with 30s…

7948e19

… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan force-pushed the oleg/node-down-2 branch from 44eee4e to ca74cc3 Compare October 14, 2024 14:49

sharnoff approved these changes Oct 14, 2024

View reviewed changes

sharnoff assigned Omrigan and unassigned sharnoff Oct 14, 2024

Omrigan added 3 commits October 15, 2024 23:18

neonvm-controller: add explicit "node not ready" tolerations with 30s…

9f6e004

… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan force-pushed the oleg/node-down-2 branch from ca74cc3 to cd416dc Compare October 15, 2024 21:19

Omrigan enabled auto-merge (rebase) October 15, 2024 21:20

Omrigan merged commit 5e5465b into main Oct 15, 2024
22 checks passed

Omrigan deleted the oleg/node-down-2 branch October 15, 2024 21:34

Omrigan added a commit that referenced this pull request Oct 15, 2024

neonvm-controller: add explicit "node not ready" tolerations with 30s…

dee47a1

… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff mentioned this pull request Nov 18, 2024

scheduler: Shorten tolerations for node failure #1146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neonvm-controller: improve node failure reaction speed #1055

neonvm-controller: improve node failure reaction speed #1055

Omrigan commented Aug 29, 2024 •

edited

Loading

sharnoff commented Oct 7, 2024

sharnoff left a comment •

edited

Loading

Omrigan commented Oct 11, 2024

sharnoff commented Oct 11, 2024

sharnoff left a comment

sharnoff Oct 14, 2024

Omrigan Oct 15, 2024

sharnoff Oct 14, 2024

Omrigan Oct 15, 2024

github-actions bot commented Oct 15, 2024 •

edited

Loading

Changed files (no unit tests)

Changed unit test files

	if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) {
	if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(&deadline) {

neonvm-controller: improve node failure reaction speed #1055

neonvm-controller: improve node failure reaction speed #1055

Conversation

Omrigan commented Aug 29, 2024 • edited Loading

sharnoff commented Oct 7, 2024

sharnoff left a comment • edited Loading

Choose a reason for hiding this comment

Omrigan commented Oct 11, 2024

sharnoff commented Oct 11, 2024

sharnoff left a comment

Choose a reason for hiding this comment

sharnoff Oct 14, 2024

Choose a reason for hiding this comment

Omrigan Oct 15, 2024

Choose a reason for hiding this comment

sharnoff Oct 14, 2024

Choose a reason for hiding this comment

Omrigan Oct 15, 2024

Choose a reason for hiding this comment

github-actions bot commented Oct 15, 2024 • edited Loading

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

HTML Report

Omrigan commented Aug 29, 2024 •

edited

Loading

sharnoff left a comment •

edited

Loading

github-actions bot commented Oct 15, 2024 •

edited

Loading