Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: join node creates new cluster when initial etcd sync config fails #5151

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

emosbaugh
Copy link
Contributor

Description

Fixes #5149

If etcd join fails to sync the etcd config and the k0s process exits, the pki ca files exist and etcd creates a new cluster rather than joining the existing one. Rather than check the pki dir for embedded etcd, check the etcd data directory exists as we do here.

I am open to suggestions here if I am checking the wrong thing as I cannot test this and am taking a guess at a solution.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • [] I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@emosbaugh
Copy link
Contributor Author

@jnummelin @twz123 I'm a bit stuck at this point with how to proceed with handling this case. Could you take another look at this PR and give me some guidance? Thank you!

Copy link
Member

@twz123 twz123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks simple enough! However, I'd leave out 9706878 for now, since k0s will retry all join errors no matter what caused them. So returning a 503 instead of a 500 is probably not really worth it?

cmd/controller/controller.go Show resolved Hide resolved
@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest)
return err
},
retry.Delay(1*time.Second),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why the delay was increased in the commit message, or in a comment? If I'm not mistaken this will now block for ~ 17 minutes, whereas it was blocking only around 100 secs before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased the number of attempts to 20. So I think it's 127 seconds plus like 12 minutes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will it be likely that joining will succeed after 15 minutes, when it didn't after 5 minutes? I'm just thinking if it makes sense to wait for that long, as surrounding tooling may have its own, shorter timeouts. How long does k0sctl wait until it aborts the join process? /cc @kke

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some discussion here related to why the timeout was increased.

Looks like k0sctl waits 2 minutes but does not face the issue we are facing because join is sequential.

@emosbaugh emosbaugh force-pushed the issue-5149-etcd-creates-new-cluster-rather-than-join-if-sync-fails branch from 6f772c3 to 7429cb8 Compare November 8, 2024 20:31
@emosbaugh emosbaugh marked this pull request as ready for review November 8, 2024 20:32
@emosbaugh emosbaugh requested review from a team as code owners November 8, 2024 20:32
@emosbaugh
Copy link
Contributor Author

@twz123 feedback addressed. Can you please take another look. Thanks

@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest)
return err
},
retry.Delay(1*time.Second),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the calculation again, as I forgot to take into account the max delay. I think it's 5 minutes overall, not 15 ... 👼

@emosbaugh emosbaugh added backport/release-1.31 PR that needs to be backported/cherrypicked to the release-1.31 branch backport/release-1.29 PR that needs to be backported/cherrypicked to the release-1.29 branch backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch labels Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/release-1.29 PR that needs to be backported/cherrypicked to the release-1.29 branch backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch backport/release-1.31 PR that needs to be backported/cherrypicked to the release-1.31 branch
Projects
None yet
2 participants