You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am quite new both to Cluster API and Talos. While I was playing around with MachineHealthChecks, I realised that a control plane node shows all conditions as True even it can't join the etcd and not ready.
There are 3 CP nodes:
talosctl get members -n 10.1.0.6
NODE NAMESPACE TYPE ID VERSION HOSTNAME MACHINE TYPE OS ADDRESSES
10.1.0.6 cluster Member test-cp-v1-8-3-56nx5 2 test-cp-v1-8-3-56nx5 controlplane Talos (v1.8.3) ["10.1.0.4"]
10.1.0.6 cluster Member test-cp-v1-8-3-5lvk7 1 test-cp-v1-8-3-5lvk7 controlplane Talos (v1.8.3) ["10.1.0.6"]
10.1.0.6 cluster Member test-cp-v1-8-3-8sqbv 1 test-cp-v1-8-3-8sqbv controlplane Talos (v1.8.3) ["10.1.0.7"]
etcd logs of test-cp-v1-8-3-8sqbv is full of following logs
10.1.0.7: {"level":"error","ts":"2024-11-26T16:38:50.279278Z","caller":"etcdserver/server.go:2378","msg":"Validation on configuration change failed","shouldApplyV3":false,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2378\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2247\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1462\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1277\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1149\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:157"}
10.1.0.7: {"level":"info","ts":"2024-11-26T16:38:50.279320Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"4986d1fdddfec87b switched to configuration voters=(13547904266639945678) learners=(11059204728065492980)"}
10.1.0.7: {"level":"warn","ts":"2024-11-26T16:38:50.279745Z","caller":"etcdserver/server.go:1154","msg":"server error","error":"the member has been permanently removed from the cluster"}
etcd logs of leader:
10.1.0.6: {"level":"warn","ts":"2024-11-26T16:51:01.859751Z","caller":"rafthttp/http.go:394","msg":"rejected stream from remote peer because it was removed","local-member-id":"bc03d3269012afce","remote-peer-id-stream-handler":"bc03d3269012afce","remote-peer-id-from":"4986d1fdddfec87b"}
Because it is permanently removed, it can't join and be rejected again and again. This is happening sometimes during my tests and the solution is just replacing the node. For this purpose I wanted to configure a MachineHealthCheck, but then I realized that is never detected as unhealthy because all conditions show True:
kubectl get Machine -n test test-cp-7bqxt
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
test-cp-7bqxt test test-cp-v1-8-3-8sqbv hcloud://56685282 Running 56m v1.30.7
kubectl get Machine -n test test-cp-7bqxt -o json | jq '.status.conditions'
[
{
"lastTransitionTime": "2024-11-26T15:52:22Z",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2024-11-26T15:50:29Z",
"status": "True",
"type": "BootstrapReady"
},
{
"lastTransitionTime": "2024-11-26T15:54:43Z",
"status": "True",
"type": "HealthCheckSucceeded"
},
{
"lastTransitionTime": "2024-11-26T15:52:22Z",
"status": "True",
"type": "InfrastructureReady"
},
{
"lastTransitionTime": "2024-11-26T15:55:06Z",
"status": "True",
"type": "NodeHealthy"
}
]
The text was updated successfully, but these errors were encountered:
Hello, first of all thanks a lot for your work!
I am quite new both to Cluster API and Talos. While I was playing around with
MachineHealthCheck
s, I realised that a control plane node shows all conditions as True even it can't join the etcd and not ready.There are 3 CP nodes:
There are 2 etcd memberes:
etcd logs of
test-cp-v1-8-3-8sqbv
is full of following logsetcd logs of leader:
Because it is permanently removed, it can't join and be rejected again and again. This is happening sometimes during my tests and the solution is just replacing the node. For this purpose I wanted to configure a MachineHealthCheck, but then I realized that is never detected as unhealthy because all conditions show True:
The text was updated successfully, but these errors were encountered: