-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: join node creates new cluster when initial etcd sync config fails #5151
base: main
Are you sure you want to change the base?
fix: join node creates new cluster when initial etcd sync config fails #5151
Conversation
f40047f
to
9706878
Compare
@jnummelin @twz123 I'm a bit stuck at this point with how to proceed with handling this case. Could you take another look at this PR and give me some guidance? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks simple enough! However, I'd leave out 9706878 for now, since k0s will retry all join errors no matter what caused them. So returning a 503 instead of a 500 is probably not really worth it?
@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque | |||
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest) | |||
return err | |||
}, | |||
retry.Delay(1*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why the delay was increased in the commit message, or in a comment? If I'm not mistaken this will now block for ~ 17 minutes, whereas it was blocking only around 100 secs before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I increased the number of attempts to 20. So I think it's 127 seconds plus like 12 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Will it be likely that joining will succeed after 15 minutes, when it didn't after 5 minutes? I'm just thinking if it makes sense to wait for that long, as surrounding tooling may have its own, shorter timeouts. How long does k0sctl wait until it aborts the join process? /cc @kke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some discussion here related to why the timeout was increased.
Looks like k0sctl waits 2 minutes but does not face the issue we are facing because join is sequential.
Signed-off-by: Ethan Mosbaugh <[email protected]>
Signed-off-by: Ethan Mosbaugh <[email protected]>
Signed-off-by: Ethan Mosbaugh <[email protected]>
This reverts commit 9706878. Signed-off-by: Ethan Mosbaugh <[email protected]>
Signed-off-by: Ethan Mosbaugh <[email protected]>
Signed-off-by: Ethan Mosbaugh <[email protected]>
6f772c3
to
7429cb8
Compare
@twz123 feedback addressed. Can you please take another look. Thanks |
@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque | |||
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest) | |||
return err | |||
}, | |||
retry.Delay(1*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼
Description
Fixes #5149
If etcd join fails to sync the etcd config and the k0s process exits, the pki ca files exist and etcd creates a new cluster rather than joining the existing one. Rather than check the pki dir for embedded etcd, check the etcd data directory exists as we do here.
I am open to suggestions here if I am checking the wrong thing as I cannot test this and am taking a guess at a solution.
Type of change
How Has This Been Tested?
Checklist: