You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a custom node condition setup using node problem detector which curls the goss check endpoints on our worker nodes every 60 seconds. We've then setup draino to look for this node condition, however, when the custom condition GossCheckFailure is true, the Cordon and Drain events aren't appearing and the conditions aren't applied to the given worker node. If the draino node is killed and restarted, those conditions are then applied immediately.
If I stop the fluentbit service on a worker, the goss checks will fail and the GossCheckFailure condition is true for that worker.
Worker node events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal GossCheckFailed <invalid> (x2 over 2d) health-checker, ip-10-252-18-10.eu-west-1.compute.internal Node condition GossCheckFailure is now: True, reason: GossCheckFailed
Warning GossCheckFailed <invalid> (x2 over 2d) health-checker, ip-10-252-18-10.eu-west-1.compute.internal
But the cordon events / DrainScheduled events are not applied to the worker.
Draino pod logs don't show anything, even in debug mode.
Draino pod logs:
$ kubectl logs draino-cc967c887-c6596 -f -n kube-addons
2021-06-17T12:39:58.118Z INFO draino/draino.go:134 web server is running {"listen": ":10002"}
2021-06-17T12:39:58.240Z DEBUG draino/draino.go:187 node labels {"labels": {"node.kubernetes.io/role":"worker"}}
2021-06-17T12:39:58.241Z DEBUG draino/draino.go:196 label expression {"expr": "metadata.labels['node.kubernetes.io/role'] == 'worker'"}
I0617 12:39:58.241899 1 leaderelection.go:235] attempting to acquire leader lease kube-addons/draino...
I0617 12:40:15.657534 1 leaderelection.go:245] successfully acquired lease kube-addons/draino
2021-06-17T12:40:15.658Z INFO draino/draino.go:235 node watcher is running
When I then kill the draino pod.
Worker node events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal GossCheckFailed 2m54s (x2 over 2d) health-checker, ip-10-252-18-10.eu-west-1.compute.internal Node condition GossCheckFailure is now: True, reason: GossCheckFailed
Warning GossCheckFailed <invalid> (x5 over 2d) health-checker, ip-10-252-18-10.eu-west-1.compute.internal
Warning CordonStarting <invalid> draino Cordoning node
Warning CordonSucceeded <invalid> draino Cordoned node
Warning DrainScheduled <invalid> draino Will drain node after 2021-06-17T12:52:18.192872059Z
Draino new pod logs:
$ kubectl logs draino-cc967c887-vk9gj -f -n kube-addons
2021-06-17T12:51:49.286Z INFO draino/draino.go:134 web server is running {"listen": ":10002"}
2021-06-17T12:51:49.288Z DEBUG draino/draino.go:187 node labels {"labels": {"node.kubernetes.io/role":"worker"}}
2021-06-17T12:51:49.288Z DEBUG draino/draino.go:196 label expression {"expr": "metadata.labels['node.kubernetes.io/role'] == 'worker'"}
I0617 12:51:49.382835 1 leaderelection.go:235] attempting to acquire leader lease kube-addons/draino...
I0617 12:52:06.918002 1 leaderelection.go:245] successfully acquired lease kube-addons/draino
2021-06-17T12:52:06.918Z INFO draino/draino.go:235 node watcher is running
2021-06-17T12:52:07.192Z DEBUG kubernetes/eventhandler.go:263 Cordoning {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z INFO kubernetes/eventhandler.go:272 Cordoned {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z DEBUG kubernetes/eventhandler.go:296 Scheduling drain {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z INFO kubernetes/eventhandler.go:308 Drain scheduled {"node": "ip-10-252-18-10.eu-west-1.compute.internal", "after": "2021-06-17T12:52:18.192Z"}
2021-06-17T12:52:18.193Z INFO kubernetes/drainSchedule.go:154 Drained {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
We have a custom node condition setup using node problem detector which curls the goss check endpoints on our worker nodes every 60 seconds. We've then setup draino to look for this node condition, however, when the custom condition
GossCheckFailure
is true, the Cordon and Drain events aren't appearing and the conditions aren't applied to the given worker node. If the draino node is killed and restarted, those conditions are then applied immediately.If I stop the fluentbit service on a worker, the goss checks will fail and the
GossCheckFailure
condition is true for that worker.Worker node events:
But the cordon events /
DrainScheduled
events are not applied to the worker.Draino pod logs don't show anything, even in debug mode.
Draino pod logs:
When I then kill the draino pod.
Worker node events:
Draino new pod logs:
This is the helm chart values configured.
Any ideas as to why this is happening?
The text was updated successfully, but these errors were encountered: