-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149
Comments
What would be a better way to determine if an existing cluster should be joined? I could imagine that k0s could delete the certs if joining the cluster fails, too... |
Also: k0s could retry 5xx responses in a back-off loop... |
@emosbaugh regarding #5151 and #5149 (comment): Would it make more sense to introduce a special marker file in the k0s data dir that k0s writes as soon as the join process is finished, instead of trying to check several places? |
This already happens today but eventually it gives up in this case https://github.com/k0sproject/k0s/blob/main/pkg/component/controller/etcd.go#L103-L115 |
That makes sense to me. Is there a directory and path that would be appropriate to store this file? |
Thinking a bit more about this... I feel like it is better to have a single source of truth, ideally etcd itself. We could use the etcd database file for that as i have it or perhaps use the result of syncEtcdConfig to detect if the current node is already a member of the cluster. |
Not sure I grokked this correctly, but I think the problem is not fully on the joining side of the code but also on the join api side of things. I mean what happens is that with 1 etcd member, you now create join request for 2 more in parallel. What happens on etcd is that once we create the new member on node 1, but node 2 has not really joined the cluster yet (etcd hasn't been started yet), there's no quorum. And at the same time we do the same for node 3. So we basically bork etcd ourselves.
I agree with this. And reflecting this to what I wrote above, I think the join API should actually check etcd state more closely to see if we can actually allow another member to join. I mean when we have 1 member up, we can only allow 1 more to be fully joined (member created and actually reached quorum). Only after this we can allow for the next one and so on. On the join api side of things we'd probably want to use some suitable HTTP status code to tell that "I cannot allow you to join at this time as there's no quorum, try again in a bit". Maybe 503 with some |
I kinda like the marker file because it's a) super-dumb to implement, b) super-easy to delete, in case somebody wants to enforce a rejoin, and c) backend agnositc. It would also work with, say kine/NATS types of setups. |
Agree: #5149 (comment) |
I agree that the issue is that the api is unstable when many nodes are joining and there is no quorum. Eventually the api will become stable in my scenario and the node will be able to join. It just does not wait long enough. When it does give up and restart, it checks the pki certs incorrectly to decide if it has already joined, which it has not. When it sees that they exist and determines that it does not need to join, it starts a new cluster rather than joining an existing cluster (joinClient is nil). If you run Are you suggesting that we continue to retry forever until join is successful? What if in other scenarios the api does not become healthy? What if the process restarts for another reason? It will still exhibit the same behavior in creating a new cluster. Therefore in my opinion, the real issue here is not one of backoff/retry but the check for "do i need to join?". |
I don't think we want to wait forever, but for longer than we currently do for sure.
Absolutely, that is part of the problem. And maybe the most important part. I'm just pointing out that the api side has some issues too which we want to address too. And by fixing both sides we make it much more robust. |
@twz123 Although I'm a little unclear of my understanding of the state of the filesystem upon certain errors, I'm concerned that if the file marker we missing and there were an etcd database then the only way to proceed would be for k0s to delete the db. I'm of the opinion that k0s should not take that drastic of an action and therefore, it is probably best to use the db as the source of truth and require the user take manual intervention when in this state. |
Fair enough. It's just unfortunate that these storage-specific implementation details leak into the more storage-agnostic parts of k0s. A proper abstraction around that would make it nicer IMO. But let's leave that for the future. |
Before creating an issue, make sure you've checked the following:
Platform
No response
Version
v1.28.14+k0s.0
Sysinfo
`k0s sysinfo`
What happened?
When I join many controller nodes in parallel, the kubernetes api can become unstable for a period. This results in the initial etcd join failing to sync the etcd config and the k0s process exiting.
When k0s starts back up, rather than join the cluster, it seems to create a new cluster.
This seems to be due to a bad assumption in this function
Steps to reproduce
Expected behavior
When joining a node, it will join the existing cluster
Actual behavior
Joining a node creates a new cluster in some circumstances
Screenshots and logs
k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt
k0scontroller-logs.txt
Additional context
No response
The text was updated successfully, but these errors were encountered: