[RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State #2557

CheyuWu · 2024-11-20T17:06:37Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator, Others

What happened + What you expected to happen

In ray-service.sample.yaml, the serveConfigV2 defines deployments for MangoStand, OrangeStand, and PearStand.

If two of these deployments are removed (e.g., keeping only MangoStand and FruitMarket), running kubectl get rayservice shows the state as WaitForServeDeploymentReady, and the service does not reach a ready state.
However, if only one deployment is removed (e.g., keeping two of the three), the service works as expected.

Reproduction script

Edit the ray-service.sample.yaml file to remove two of the three deployments in serveConfigV2 (e.g., keep only MangoStand).
Apply the updated ray-service.sample.yaml using kubectl apply -f ray-service.sample.yaml.
Run kubectl get rayservice and observe the status remaining in WaitForServeDeploymentReady

# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 2
            max_replicas_per_node: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
      - name: math_app
        import_path: conditional_dag.serve_dag
        route_prefix: /calc
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: Adder
            num_replicas: 1
            user_config:
              increment: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: Multiplier
            num_replicas: 1
            user_config:
              factor: 5
            ray_actor_options:
              num_cpus: 0.1
          - name: Router
            num_replicas: 1
  rayClusterConfig:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              resources:
                limits:
                  cpu: 2
                  memory: 4Gi
                requests:
                  cpu: 2
                  memory: 4Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
                    memory: "2Gi"

Anything else

We need to investigate why removing two deployments causes the issue while removing only one deployment does not. It seems like there might be a threshold or configuration issue in serveConfigV2.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

CheyuWu added bug Something isn't working triage labels Nov 20, 2024

CheyuWu mentioned this issue Nov 20, 2024

[Test] Implement RayService In-place update test in Golang #2536

Merged

4 tasks

CheyuWu changed the title ~~[Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State~~ [RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State Nov 20, 2024

kevin85421 added serve rayservice 1.3.0 and removed triage labels Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State #2557

[RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State #2557

CheyuWu commented Nov 20, 2024

[RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State #2557

[RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State #2557

Comments

CheyuWu commented Nov 20, 2024

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?