Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset Dask worker to use TCP even if it was configured to use TLS in yaml file #836

Open
weiwang217 opened this issue Oct 18, 2023 · 8 comments

Comments

@weiwang217
Copy link

weiwang217 commented Oct 18, 2023

Describe the issue:
The DASK operator reset to use TCP even if it was configured to use TLS

- name: DASK_SCHEDULER_ADDRESS
value: tls://scheduler.join.svc.cluster.local:8786
- name: DASK_TEMPORARY_DIRECTORY
value: /tmp
- name: DASK_WORKER_NAME
value: default-worker-6a9c9e4f94
- name: DASK_SCHEDULER_ADDRESS
value: tcp://scheduler.join.svc.cluster.local:8786

The code to append the config is here:

"name": "DASK_SCHEDULER_ADDRESS",

Minimal Complete Verifiable Example:

# Put your MCVE code here

Anything else we need to know?:

Environment:

  • Dask version:
  • Python version:
  • Operating System:
  • Install method (conda, pip, source):
@jacobtomlinson
Copy link
Member

Thanks for raising this @weiwang217. I've opened #837 to resolve this. Would you mind testing that PR out and letting me know if it solves your problem?

@weiwang217
Copy link
Author

weiwang217 commented Oct 19, 2023 via email

@jacobtomlinson
Copy link
Member

We have documentation on how to do this here https://kubernetes.dask.org/en/latest/testing.html#testing-operator-controller-prs

@weiwang217
Copy link
Author

weiwang217 commented Oct 20, 2023 via email

@kjleftin
Copy link

Hi Jacob,

I have a suspicion that change may have caused a regression when working with replicas > 1. When I start a new DaskJob, all but one replica fails to connect to the scheduler because of duplicate names. Indeed when I run
kubectl describe pod <worker_pod>

I see:
Worker 1:

    Environment:
      DASK_WORKER_NAME:                                  simple-job-default-worker-a10a25ac26
      DASK_SCHEDULER_ADDRESS:                      tcp://simple-job-scheduler.join.svc.cluster.local:8786
      ...

Worker 2:

    Environment:
      DASK_WORKER_NAME:                                  simple-job-default-worker-00add84cde
      DASK_SCHEDULER_ADDRESS:                      tcp://simple-job-scheduler.join.svc.cluster.local:8786
      DASK_WORKER_NAME:                                   simple-job-default-worker-a10a25ac26
      DASK_SCHEDULER_ADDRESS:                      tcp://simple-job-scheduler.join.svc.cluster.local:8786

Because the last defined environment variable is the first replica, all replicas share the same name.

Do you mind taking a look?

(Context: I'm on the same team as weiwang217 and we just noticed this change recently)

@jacobtomlinson
Copy link
Member

Thanks for reporting this @kjleftin. Why are you setting the DASK_WORKER_NAME in your config?

@kjleftin
Copy link

Hi Jacob,

I'm following the example code in https://kubernetes.dask.org/en/latest/operator_resources.html#daskjob

Specifically, passing the DASK_WORKER_NAME env. variable to the dask worker CLI:

            - name: worker
              image: "ghcr.io/dask/dask:latest"
              imagePullPolicy: "IfNotPresent"
              args:
                - dask-worker
                - --name
                - $(DASK_WORKER_NAME)
                - --dashboard
                - --dashboard-address
                - "8788"

Note that I'm not setting DASK_WORKER_NAME explicitly. That is handled by the Dask Operator. (Before this change, each worker would have a different value for DASK_WORKER_NAME, but after this change, each worker has the same value).

@jacobtomlinson
Copy link
Member

@kjleftin ok thanks for the clarification. I expect we may need to use copy to avoid this. I'll take a look at the PR and update it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants