-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neonvm-controller: improve node failure reaction speed #1055
Conversation
e27cbd4
to
35bab49
Compare
35bab49
to
837f32e
Compare
788b5b4
to
1ff1fd0
Compare
a1e7ea4
to
e2a2389
Compare
454070e
to
23e9957
Compare
95fbd47
to
b4cd9ad
Compare
23e9957
to
a7fbb7a
Compare
5b1c98c
to
ba0f505
Compare
a7fbb7a
to
f1deb9d
Compare
@Omrigan IIUC this PR needs rebasing? |
ba0f505
to
01ab8d9
Compare
01ab8d9
to
44eee4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly looks good, I think.
The one major item I'd like to see is to have this feature-gated behind some CLI flag. Reasoning is that this type of change is at higher risk of cascading failures (e.g., if restarting causes us to trigger even more restarts) — so we should have an escape hatch, just in case.
Other than that, a couple admin notes:
- Some of the changes are internal to neonvm-controller ; could those commits be titled with
neonvm-controller:
as the prefix? - If you plan to rebase-and-merge, could you edit the commit titles to add the PR number before merging? (i.e. appending
(#1055)
so it looks similar to squash-and-merge)
I was thinking maybe it we don't really need it? From the history, if you click on commit, then the commit will have a PR link. So editing the commits saves us 1 click. |
It's very useful to have that information available when interacting with git locally |
… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>
If we are passed the mark of the deletion timestamp, it means the deletion is stuck, and we should consider the pod to be failed anyway. Possible reasons for this are: 1. Node is down. 2. Pod is stuck pulling the image from the container registry. Signed-off-by: Oleg Vasilev <[email protected]>
…l exists (#1055) We have previously assumed running multiple pods for one VM is dangerous. We now believe it should be fine, even if the old pod still tries to continue operating. What would happen is that the new pod would take over the safekeepers quorum, leaving the old one to disconnect and shutdown. Signed-off-by: Oleg Vasilev <[email protected]>
44eee4e
to
ca74cc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! two comments.
// Add 5 seconds to account for clock skew and k8s lagging behind. | ||
deadline := metav1.NewTime(metav1.Now().Add(-5 * time.Second)) | ||
|
||
if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) { | ||
return runnerFailed | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a brief comment here explaining what we're doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
// Add 5 seconds to account for clock skew and k8s lagging behind. | ||
deadline := metav1.NewTime(metav1.Now().Add(-5 * time.Second)) | ||
|
||
if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(lo.ToPtr(deadline)) { | |
if pod.DeletionTimestamp != nil && pod.DeletionTimestamp.Before(&deadline) { |
? (up to you)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>
If we are passed the mark of the deletion timestamp, it means the deletion is stuck, and we should consider the pod to be failed anyway. Possible reasons for this are: 1. Node is down. 2. Pod is stuck pulling the image from the container registry. Signed-off-by: Oleg Vasilev <[email protected]>
…l exists (#1055) We have previously assumed running multiple pods for one VM is dangerous. We now believe it should be fine, even if the old pod still tries to continue operating. What would happen is that the new pod would take over the safekeepers quorum, leaving the old one to disconnect and shutdown. Signed-off-by: Oleg Vasilev <[email protected]>
ca74cc3
to
cd416dc
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
HTML Report |
… grace period (#1055) By default they are 300s (5m), which is way too long. Signed-off-by: Oleg Vasilev <[email protected]>
If we are passed the mark of the deletion timestamp, it means the deletion is stuck, and we should consider the pod to be failed anyway. Possible reasons for this are: 1. Node is down. 2. Pod is stuck pulling the image from the container registry. Signed-off-by: Oleg Vasilev <[email protected]>
Similar to what was done in #1055, we need to explicitly add tolerations to the scheduler to get it to be recreated more quickly on node failure. This is particularly necessary because we don't have #955. We could wait for that, but it's a lot of work, and this is a small thing we can do in the meantime. Fixes neondatabase/cloud#17298.
Similar to what was done in #1055, we need to explicitly add tolerations to the scheduler to get it to be recreated more quickly on node failure. This is particularly necessary because we don't have #995. We could wait for that, but it's a lot of work, and this is a small thing we can do in the meantime. Fixes neondatabase/cloud#17298, part of neondatabase/cloud#14114.
Multiple commits, their descriptions:
neonvm: recreate the pod for VM even if the older pod still exists
We have previously assumed running multiple pods for one VM is
dangerous.
We now believe it should be fine, even if the old pod still tries to
continue operating. What would happen is that the new pod would take
over the safekeepers quorum, leaving the old one to disconnect and
shutdown.
neonvm: check if deletion timestamp is in the past
If we are passed the mark of the deletion timestamp, it means
the deletion is stuck, and we should consider the pod failed anyway.
Possible reasons for this are:
neonvm: add explicit "node not ready" tolerations with 30s grace period
By default they are 300s (5m), which is way too long.
Part of the https://github.com/neondatabase/cloud/issues/14114