Forensics controller crash loop is because of worker pod evicted by kube-controller-manager #5

tekenstam · 2019-08-15T01:14:03Z

Is this a BUG REPORT or FEATURE REQUEST?: BUG

What happened: When job controller submits a worker pod without any tolerations it will be skipped by NoExecute as we are using nodeName to schedule the pod on the specific node, but NoExecute will evict the pod as soon as it is scheduled.

What you expected to happen:

forensics-controller should submit bare pod and check the status if it needs to re-submit another pod I case of any failures.
provide an option in PodCheckpoint spec for a user to submit a list of tolerations

How to reproduce it (as minimally and precisely as possible):

Add taints to all nodes
Run a pod with tolerations on the tainted node
Try to get PodCheckpoint of pod.

Other debugging information (if applicable):

Example:

I have an IG name appikstesting which has following taints on all nodes

   Taints:             ig/appikstesting:NoExecute
                       ig/appikstesting:NoSchedule

Resulting in following

kgp
NAME                                            READY   STATUS        RESTARTS   AGE    IP                NODE                                        NOMINATED NODE   READINESS GATES
forensics-controller-manager-77f778974f-7ng89   2/2     Running       0          150m   100.117.143.135   ip-10-83-9-41.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-2lpfc                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-4bkzc                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-4cmd8                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-5sjvd                        0/1     Terminating   0          54s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-67dwl                        0/1     Terminating   0          53s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-7p9qh                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-7zjdh                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-8zx6x                        0/1     Terminating   0          57s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-9rf7s                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-b2trw                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-bbdmw                        0/1     Terminating   0          60s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-bttjl                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-c56c5                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-dt697                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-fvrv2                        0/1     Terminating   0          56s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-hxx9f                        0/1     Terminating   0          56s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-j559x                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-j9zms                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-kbx6w                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-knddq                        0/1     Terminating   0          56s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-l2zkj                        0/1     Terminating   0          57s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-l5mq4                        0/1     Terminating   0          60s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-lp65r                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-m4d4z                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-mnslk                        0/1     Terminating   0          53s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-mvln4                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-p9z55                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-pmmjd                        0/1     Terminating   0          58s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-qjwkn                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-qlbxr                        0/1     Terminating   0          60s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-qr89n                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-r69vc                        0/1     Terminating   0          58s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-rpd96                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-rqvqw                        0/1     Terminating   0          64s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-rzhnc                        0/1     Terminating   0          54s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-s7fq8                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-v2fh6                        0/1     Terminating   0          64s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-vkc2s                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-vqkjv                        0/1     Terminating   0          64s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-wcp6v                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-wdnwg                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-wqmtz                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-zhxmn                        0/1     Terminating   0          54s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>

Above fight between job controller and kube-controller-manager was able to manage to scheduled and evict about 23000 pod's in 30mins time.

I have removed the NoExecute taint on the node in question, Then pods got scheduled as expected and everything worked as expected.

kgp
NAME                                            READY   STATUS      RESTARTS   AGE     IP                NODE                                        NOMINATED NODE   READINESS GATES
forensics-controller-manager-77f778974f-7ng89   2/2     Running     0          156m    100.117.143.135   ip-10-83-9-41.us-west-2.compute.internal    <none>           <none>
kiampod-test-job-ctwlf                          0/1     Completed   0          43m     10.83.8.241       ip-10-83-8-241.us-west-2.compute.internal   <none>           <none>
workflowpod-test-job-7t9b4                      1/1     Running     0          3m56s   10.83.9.41        ip-10-83-9-41.us-west-2.compute.internal    <none>           <none>

The text was updated successfully, but these errors were encountered:

eytan-avisror · 2019-08-22T16:46:10Z

I think using nodeName is problematic here because it bypasses the scheduler by declaring which node you want to schedule on.
Where as if you went through scheduler the pod would have been pending since it is missing the required toleration.

Maybe the fix should be to:

Affinitize to a node instead of setting it in nodeName (or use nodeSelector).
Get the node spec prior to scheduling the job, find any taints, and add matching tolerations to your job.

tekenstam added the bug Something isn't working label Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forensics controller crash loop is because of worker pod evicted by kube-controller-manager #5

Forensics controller crash loop is because of worker pod evicted by kube-controller-manager #5

tekenstam commented Aug 15, 2019

eytan-avisror commented Aug 22, 2019 •

edited

Loading

Forensics controller crash loop is because of worker pod evicted by kube-controller-manager #5

Forensics controller crash loop is because of worker pod evicted by kube-controller-manager #5

Comments

tekenstam commented Aug 15, 2019

eytan-avisror commented Aug 22, 2019 • edited Loading

eytan-avisror commented Aug 22, 2019 •

edited

Loading