Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forensics controller crash loop is because of worker pod evicted by kube-controller-manager #5

Open
tekenstam opened this issue Aug 15, 2019 · 1 comment
Labels
bug Something isn't working

Comments

@tekenstam
Copy link
Member

Is this a BUG REPORT or FEATURE REQUEST?: BUG

What happened: When job controller submits a worker pod without any tolerations it will be skipped by NoExecute as we are using nodeName to schedule the pod on the specific node, but NoExecute will evict the pod as soon as it is scheduled.

What you expected to happen:

  • forensics-controller should submit bare pod and check the status if it needs to re-submit another pod I case of any failures.
  • provide an option in PodCheckpoint spec for a user to submit a list of tolerations

How to reproduce it (as minimally and precisely as possible):

  1. Add taints to all nodes
  2. Run a pod with tolerations on the tainted node
  3. Try to get PodCheckpoint of pod.

Other debugging information (if applicable):

Example:

  • I have an IG name appikstesting which has following taints on all nodes
   Taints:             ig/appikstesting:NoExecute
                       ig/appikstesting:NoSchedule

Resulting in following

kgp
NAME                                            READY   STATUS        RESTARTS   AGE    IP                NODE                                        NOMINATED NODE   READINESS GATES
forensics-controller-manager-77f778974f-7ng89   2/2     Running       0          150m   100.117.143.135   ip-10-83-9-41.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-2lpfc                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-4bkzc                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-4cmd8                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-5sjvd                        0/1     Terminating   0          54s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-67dwl                        0/1     Terminating   0          53s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-7p9qh                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-7zjdh                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-8zx6x                        0/1     Terminating   0          57s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-9rf7s                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-b2trw                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-bbdmw                        0/1     Terminating   0          60s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-bttjl                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-c56c5                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-dt697                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-fvrv2                        0/1     Terminating   0          56s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-hxx9f                        0/1     Terminating   0          56s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-j559x                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-j9zms                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-kbx6w                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-knddq                        0/1     Terminating   0          56s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-l2zkj                        0/1     Terminating   0          57s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-l5mq4                        0/1     Terminating   0          60s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-lp65r                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-m4d4z                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-mnslk                        0/1     Terminating   0          53s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-mvln4                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-p9z55                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-pmmjd                        0/1     Terminating   0          58s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-qjwkn                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-qlbxr                        0/1     Terminating   0          60s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-qr89n                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-r69vc                        0/1     Terminating   0          58s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-rpd96                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-rqvqw                        0/1     Terminating   0          64s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-rzhnc                        0/1     Terminating   0          54s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-s7fq8                        0/1     Terminating   0          62s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-v2fh6                        0/1     Terminating   0          64s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-vkc2s                        0/1     Terminating   0          61s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-vqkjv                        0/1     Terminating   0          64s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-wcp6v                        0/1     Terminating   0          63s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-wdnwg                        0/1     Terminating   0          59s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-wqmtz                        0/1     Terminating   0          55s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>
splunkpod-test-job-zhxmn                        0/1     Terminating   0          54s    <none>            ip-10-83-8-69.us-west-2.compute.internal    <none>           <none>

Above fight between job controller and kube-controller-manager was able to manage to scheduled and evict about 23000 pod's in 30mins time.

I have removed the NoExecute taint on the node in question, Then pods got scheduled as expected and everything worked as expected.

kgp
NAME                                            READY   STATUS      RESTARTS   AGE     IP                NODE                                        NOMINATED NODE   READINESS GATES
forensics-controller-manager-77f778974f-7ng89   2/2     Running     0          156m    100.117.143.135   ip-10-83-9-41.us-west-2.compute.internal    <none>           <none>
kiampod-test-job-ctwlf                          0/1     Completed   0          43m     10.83.8.241       ip-10-83-8-241.us-west-2.compute.internal   <none>           <none>
workflowpod-test-job-7t9b4                      1/1     Running     0          3m56s   10.83.9.41        ip-10-83-9-41.us-west-2.compute.internal    <none>           <none>
@tekenstam tekenstam added the bug Something isn't working label Aug 15, 2019
@eytan-avisror
Copy link
Collaborator

eytan-avisror commented Aug 22, 2019

I think using nodeName is problematic here because it bypasses the scheduler by declaring which node you want to schedule on.
Where as if you went through scheduler the pod would have been pending since it is missing the required toleration.

Maybe the fix should be to:

  • Affinitize to a node instead of setting it in nodeName (or use nodeSelector).
  • Get the node spec prior to scheduling the job, find any taints, and add matching tolerations to your job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants