Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State #2557

Open
1 of 2 tasks
CheyuWu opened this issue Nov 20, 2024 · 0 comments
Labels
1.3.0 bug Something isn't working rayservice serve

Comments

@CheyuWu
Copy link
Contributor

CheyuWu commented Nov 20, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator, Others

What happened + What you expected to happen

In ray-service.sample.yaml, the serveConfigV2 defines deployments for MangoStand, OrangeStand, and PearStand.

  • If two of these deployments are removed (e.g., keeping only MangoStand and FruitMarket), running kubectl get rayservice shows the state as WaitForServeDeploymentReady, and the service does not reach a ready state.
  • However, if only one deployment is removed (e.g., keeping two of the three), the service works as expected.

Reproduction script

Edit the ray-service.sample.yaml file to remove two of the three deployments in serveConfigV2 (e.g., keep only MangoStand).
Apply the updated ray-service.sample.yaml using kubectl apply -f ray-service.sample.yaml.
Run kubectl get rayservice and observe the status remaining in WaitForServeDeploymentReady

# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 2
            max_replicas_per_node: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
      - name: math_app
        import_path: conditional_dag.serve_dag
        route_prefix: /calc
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: Adder
            num_replicas: 1
            user_config:
              increment: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: Multiplier
            num_replicas: 1
            user_config:
              factor: 5
            ray_actor_options:
              num_cpus: 0.1
          - name: Router
            num_replicas: 1
  rayClusterConfig:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              resources:
                limits:
                  cpu: 2
                  memory: 4Gi
                requests:
                  cpu: 2
                  memory: 4Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
                    memory: "2Gi"

Anything else

We need to investigate why removing two deployments causes the issue while removing only one deployment does not. It seems like there might be a threshold or configuration issue in serveConfigV2.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@CheyuWu CheyuWu added bug Something isn't working triage labels Nov 20, 2024
@CheyuWu CheyuWu changed the title [Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State [RayService][Bug] Partial Removal of Deployments in ray-service.sample.yaml's ServeConfigV2 Causes WaitForServeDeploymentReady State Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.3.0 bug Something isn't working rayservice serve
Projects
None yet
Development

No branches or pull requests

2 participants