Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard loading slowly and hang #805

Closed
dbalabka opened this issue Aug 24, 2023 · 11 comments
Closed

Dashboard loading slowly and hang #805

dbalabka opened this issue Aug 24, 2023 · 11 comments
Labels

Comments

@dbalabka
Copy link
Contributor

dbalabka commented Aug 24, 2023

Describe the issue:

The dashboard loads very slowly. Port forwarding reports timeouts and broken pipe errors during dashboard loads:

❯ kubectl port-forward 'cluser-scheduler-d99f64dc8-hwkt5' 8787:8787
Forwarding from 127.0.0.1:8787 -> 8787
Forwarding from [::1]:8787 -> 8787
Handling connection for 8787
Handling connection for 8787
Handling connection for 8787
Handling connection for 8787
E0824 17:23:18.023748   11638 portforward.go:378] error copying from remote stream to local connection: readfrom tcp6 [::1]:8787->[::1]:35270: write tcp6 [::1]:8787->[::1]:35270: write: broken pipe
Handling connection for 8787
❯ kubectl port-forward 'cluser-scheduler-d99f64dc8-hwkt5' 8787:8787
Forwarding from 127.0.0.1:8787 -> 8787
Forwarding from [::1]:8787 -> 8787
Handling connection for 8787
Handling connection for 8787
Handling connection for 8787
Handling connection for 8787
Handling connection for 8787
Handling connection for 8787
E0824 17:15:27.312003   11363 portforward.go:368] error creating forwarding stream for port 8787 -> 8787: Timeout occurred
E0824 17:15:27.312033   11363 portforward.go:368] error creating forwarding stream for port 8787 -> 8787: Timeout occurred
E0824 17:15:27.312060   11363 portforward.go:368] error creating forwarding stream for port 8787 -> 8787: Timeout occurred
E0824 17:15:27.312063   11363 portforward.go:368] error creating forwarding stream for port 8787 -> 8787: Timeout occurred
Handling connection for 8787
Handling connection for 8787
Handling connection for 8787
E0824 17:15:57.375975   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
E0824 17:15:57.392403   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
Handling connection for 8787
E0824 17:16:20.522829   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
E0824 17:16:27.573642   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
Handling connection for 8787
Handling connection for 8787
E0824 17:16:58.658054   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
E0824 17:16:58.658175   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
Handling connection for 8787
E0824 17:17:28.665025   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
Handling connection for 8787
Handling connection for 8787
E0824 17:18:03.699222   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
E0824 17:18:03.699310   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
Handling connection for 8787
E0824 17:18:33.708104   11363 portforward.go:346] error creating error stream for port 8787 -> 8787: Timeout occurred
Handling connection for 8787
Handling connection for 8787

Disabling JS solve the issue. It looks like it is related to WebSockets. Adding Ingress with an HTTPS certificate solves the problem.

Minimal Complete Verifiable Example:

from dask_kubernetes.operator import KubeCluster, make_cluster_spec

spec = make_cluster_spec(
    name="dask-cluser", 
    image='ghcr.io/dask/dask:2023.8.1-py3.10',
    n_workers=1, 
    env={
        "EXTRA_PIP_PACKAGES": "s3fs",
    },
)
 
cluster = KubeCluster(
    namespace="dask-operator", 
    custom_cluster_spec=spec,
)
cluster = KubeCluster.from_name('dmitryb-cluser')

Anything else we need to know?:

Environment:

Required poetry packages:

[tool.poetry]
name = "test-bug"
version = "0.1.0"
description = ""
authors = ["Your Name <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10.10,<3.11"
pandas = "^2.0.0"
dask-kubernetes = "~2023.8.0"
dask = "~2023.8.0"
s3fs = "^2023.6.0"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

K8S version:

❯ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.16", GitCommit:"51e33fadff13065ae5518db94e84598293965939", GitTreeState:"clean", BuildDate:"2023-07-19T12:26:21Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.14+rke2r1", GitCommit:"0018fa8af8ffaff22f36d3bd86289753ca61da81", GitTreeState:"clean", BuildDate:"2023-05-18T17:22:39Z", GoVersion:"go1.19.9 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
@dbalabka
Copy link
Contributor Author

dbalabka commented Aug 24, 2023

I assume kubectl port-forward does not work properly with WebSockets. The solution might be providing the option to create Ingress attached to existing service. make_cluster_spec function can generate the following YAML:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: cluser-scheduler
  namespace: dask-operator
spec:
  rules:
  - host: <Ask user to provide a host because it is Kubernetes setup specific value>
    http:
      paths:
      - backend:
          service:
            name: cluser-scheduler
            port:
              number: 8787
        pathType: ImplementationSpecific

@dbalabka
Copy link
Contributor Author

It seems that kubectl port-forward does not support WS connections
kubernetes/kubernetes#74884

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Aug 24, 2023

We have used kubectl port-forward successfully with the Dask Dashboard for many years, so I don't think it's that. Websockets work just fine in my experience, however, the connection is known to close after some period of time (see #663).

It looks to me like the network connection between your machine and the Kubernetes API is not very stable given all the timeout errors in your output. Or the Kubernetes API nodes are too busy.

I agree that you might want to expose an Ingress instead of using port-forwards if your k8s cluster is set up for that. However, that is out of scope for our Python API because there are just so many different ways to do it. I would recommend defining your cluster in YAML along with an appropriate Ingress and then optionally from Python use KubeCluster.from_name to get an object pointer to it for convenience.

Alternatively you could change the service type to LoadBalancer if you have LB provisioning configured on your cluster, in that case KubeCluster will wait for the LB to go into service and connect to that instead of performing a port-forward.

@dbalabka
Copy link
Contributor Author

It looks to me like the network connection between your machine and the Kubernetes API is not very stable given all the timeout errors in your output. Or the Kubernetes API nodes are too busy.

Taking into account that Ingress working fine, it proves that the network connection is good. Also, nodes do not perform any work at this time.

I agree that you might want to expose an Ingress instead of using port-forwards if your k8s cluster is set up for that. However, that is out of scope for our Python API because there are just so many different ways to do it.

I agree that it is too specific for each cluster. I'm looking for an elegant solution for a team and avoiding the boilerplate code. Probably, creating an extra function that will create cluster-specific YAML should solve the issue.

Alternatively you could change the service type to LoadBalancer if you have LB provisioning configured on your cluster, in that case KubeCluster will wait for the LB to go into service and connect to that instead of performing a port-forward.

Thanks, I will take a look into it.

@jacobtomlinson
Copy link
Member

Taking into account that Ingress working fine, it proves that the network connection is good. Also, nodes do not perform any work at this time.

I was talking about the control plane, not the nodes. Port forwarding requires the control plane to proxy the traffic, an Ingress goes to the nodes directly.

Probably, creating an extra function that will create cluster-specific YAML should solve the issue.

If you want to do it in Python you could do something like this. But it's still a lot of boilerplate.

import kr8s

ingress = kr8s.objects.Ingress(
    {
        "apiVersion": "networking.k8s.io/v1",
        "kind": "Ingress",
        "metadata": {"name": "cluser-scheduler", "namespace": "dask-operator"},
        "spec": {
            "rules": [
                {
                    "host": "<Ask user to provide a host because it is Kubernetes setup specific value>",
                    "http": {
                        "paths": [
                            {
                                "backend": {
                                    "service": {
                                        "name": "cluser-scheduler",
                                        "port": {"number": 8787},
                                    }
                                },
                                "pathType": "ImplementationSpecific",
                            }
                        ]
                    },
                }
            ]
        },
    }
)
ingress.create()

@dbalabka
Copy link
Contributor Author

@jacobtomlinson, so it seems the problem is connected with containerd runtime. Recently, we migrated from docker to containerd. Probably, you are not noticing the problem because of the docker runtime.
kubernetes/kubernetes#74551 (comment)

@jacobtomlinson
Copy link
Member

@dbalabka I'm not sure that's the problem. We regularly use GKE for testing which uses containerd by default.

@dbalabka
Copy link
Contributor Author

I tried to deploy it on the old cluster. So, my observations are following:

  1. docker engine 20.10.7 - dashboard works
  2. containerd://1.7.1-k3s1 - doesn't work

I probably have to use Ingress or ask to upgrade containerd version that contains:
containerd/containerd#8418

@dbalabka
Copy link
Contributor Author

Actually, I was wrong. The same issue with the docker engine. The dashboard worked for some short period of time and started to hang. Probably, Ingress is not such a bad option. I suspect that the Upgrade of containerd will not help. @jacobtomlinson , you are right it is something environment-specific.

@dbalabka
Copy link
Contributor Author

As suggested @jacobtomlinson, the problem workaround is using NodePort instead of ClusterIP:

spec = make_cluster_spec(
    ...
    scheduler_service_type="NodePort",
    ...
)
 
cluster = KubeCluster(
    ...
    custom_cluster_spec=spec,
)

@dbalabka
Copy link
Contributor Author

@jacobtomlinson, it seems that in the latest version of dask-kubernetes or the newer version of k8s, the problem is gone. I will check it again right after #851 will be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants