Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlplane machine shows Ready even while etcd is failing #203

Open
bitnik opened this issue Nov 26, 2024 · 1 comment
Open

Controlplane machine shows Ready even while etcd is failing #203

bitnik opened this issue Nov 26, 2024 · 1 comment

Comments

@bitnik
Copy link

bitnik commented Nov 26, 2024

Hello, first of all thanks a lot for your work!

I am quite new both to Cluster API and Talos. While I was playing around with MachineHealthChecks, I realised that a control plane node shows all conditions as True even it can't join the etcd and not ready.

There are 3 CP nodes:

talosctl get members -n 10.1.0.6
NODE       NAMESPACE   TYPE     ID                            VERSION   HOSTNAME                      MACHINE TYPE   OS               ADDRESSES
10.1.0.6   cluster     Member   test-cp-v1-8-3-56nx5      2         test-cp-v1-8-3-56nx5      controlplane   Talos (v1.8.3)   ["10.1.0.4"]
10.1.0.6   cluster     Member   test-cp-v1-8-3-5lvk7      1          test-cp-v1-8-3-5lvk7      controlplane   Talos (v1.8.3)   ["10.1.0.6"]
10.1.0.6   cluster     Member   test-cp-v1-8-3-8sqbv      1         test-cp-v1-8-3-8sqbv      controlplane   Talos (v1.8.3)   ["10.1.0.7"]

There are 2 etcd memberes:

talosctl etcd members -n 10.1.0.6
NODE       ID                 HOSTNAME                   PEER URLS               CLIENT URLS             LEARNER
10.1.0.6   392559fe4b474923   test-cp-v1-8-3-56nx5   https://10.1.0.4:2380   https://10.1.0.4:2379   false
10.1.0.6   bc03d3269012afce   test-cp-v1-8-3-5lvk7   https://10.1.0.6:2380   https://10.1.0.6:2379   false

etcd logs of test-cp-v1-8-3-8sqbv is full of following logs

10.1.0.7: {"level":"error","ts":"2024-11-26T16:38:50.279278Z","caller":"etcdserver/server.go:2378","msg":"Validation on configuration change failed","shouldApplyV3":false,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2378\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2247\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1462\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1277\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1149\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:157"}
10.1.0.7: {"level":"info","ts":"2024-11-26T16:38:50.279320Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"4986d1fdddfec87b switched to configuration voters=(13547904266639945678) learners=(11059204728065492980)"}
10.1.0.7: {"level":"warn","ts":"2024-11-26T16:38:50.279745Z","caller":"etcdserver/server.go:1154","msg":"server error","error":"the member has been permanently removed from the cluster"}

etcd logs of leader:

10.1.0.6: {"level":"warn","ts":"2024-11-26T16:51:01.859751Z","caller":"rafthttp/http.go:394","msg":"rejected stream from remote peer because it was removed","local-member-id":"bc03d3269012afce","remote-peer-id-stream-handler":"bc03d3269012afce","remote-peer-id-from":"4986d1fdddfec87b"}

Because it is permanently removed, it can't join and be rejected again and again. This is happening sometimes during my tests and the solution is just replacing the node. For this purpose I wanted to configure a MachineHealthCheck, but then I realized that is never detected as unhealthy because all conditions show True:

kubectl get Machine -n test test-cp-7bqxt

NAME                CLUSTER    NODENAME                   PROVIDERID          PHASE     AGE   VERSION
test-cp-7bqxt   test             test-cp-v1-8-3-8sqbv   hcloud://56685282   Running   56m   v1.30.7

kubectl get Machine -n test test-cp-7bqxt -o json | jq '.status.conditions'

[
  {
    "lastTransitionTime": "2024-11-26T15:52:22Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2024-11-26T15:50:29Z",
    "status": "True",
    "type": "BootstrapReady"
  },
  {
    "lastTransitionTime": "2024-11-26T15:54:43Z",
    "status": "True",
    "type": "HealthCheckSucceeded"
  },
  {
    "lastTransitionTime": "2024-11-26T15:52:22Z",
    "status": "True",
    "type": "InfrastructureReady"
  },
  {
    "lastTransitionTime": "2024-11-26T15:55:06Z",
    "status": "True",
    "type": "NodeHealthy"
  }
]
@bitnik
Copy link
Author

bitnik commented Nov 28, 2024

Please note that MachineHealthChecks currently only support Machines that are owned by a MachineSet or a KubeadmControlPlane.

I missed this important part in the documentation. Still it is confusing though why it shows Ready as True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant