Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Service is terminated if scaling fails #1979

Open
jvstme opened this issue Nov 11, 2024 · 0 comments
Open

[Bug]: Service is terminated if scaling fails #1979

jvstme opened this issue Nov 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@jvstme
Copy link
Collaborator

jvstme commented Nov 11, 2024

Steps to reproduce

This can be reproduced both with auto-scaling and with manual in-place update. This example uses in-place update.

  1. Provision a fleet with one instance.
> cat fleets/cloud.dstack.yml
type: fleet
name: cloud
nodes: 1
spot_policy: auto

> dstack apply -f fleets/cloud.dstack.yml -y
  1. Run a single-replica service with --reuse.
> cat services/.dstack.yml 
type: service
name: my-service

commands:
  - python -m http.server
port: 8000

replicas: 1

spot_policy: auto

> dstack apply -f services/.dstack.yml --reuse -y
  1. Try scaling the service to two replicas by updating and reapplying the configuration with --reuse.
> cat services/.dstack.yml 
type: service
name: my-service

commands:
  - python -m http.server
port: 8000

replicas: 2

spot_policy: auto

> dstack apply -f services/.dstack.yml --reuse -y

Actual behaviour

The second replica fails because there are no idle instances to reuse. The run is terminated because the second replica failed to start.

 #  BACKEND  REGION    INSTANCE       RESOURCES                   SPOT  PRICE         
 1  gcp      us-west4  e2-standard-2  2xCPU, 8GB, 100.0GB (disk)  yes   $0.0095  busy 

Active run my-service already exists. Detected configuration changes that can be updated in-place: ['replicas']
my-service provisioning completed (running)
Service is published at http://localhost:3000/proxy/services/ilya/my-service/

Serving HTTP on 0.0.0.0 port 8000 (http://localhost:3000/proxy/services/ilya/my-service/) ...
Run failed with error code TERMINATED_BY_SERVER.
Check CLI, server, and run logs for more details

Expected behaviour

The second replica fails because there are no idle instances to reuse. The run remains running.

dstack version

0.18.24

Server logs

INFO     dstack._internal.server.services.runs:861 run(46bef6)my-service: scaling UP 1 replica(s)
[09:06:24] DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:99 job(c9251a)my-service-0-1: provisioning has started
[09:06:29] DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:99 job(c9251a)my-service-0-1: provisioning has started
           DEBUG    dstack._internal.server.background.tasks.process_submitted_jobs:222 job(c9251a)my-service-0-1: reuse instance failed
           INFO     dstack._internal.server.background.tasks.process_runs:330 run(46bef6)my-service: run status has changed RUNNING -> TERMINATING
           INFO     dstack._internal.server.services.jobs:283 job(c9251a)my-service-0-1: job status is FAILED, reason: FAILED_TO_START_DUE_TO_NO_CAPACITY
           DEBUG    dstack._internal.server.services.jobs:192 job(c1f616)my-service-0-0: stopping runner 34.16.243.7
           DEBUG    dstack._internal.server.services.jobs:234 job(c1f616)my-service-0-0: stopping container
[09:07:02] INFO     dstack._internal.server.services.jobs:268 job(c1f616)my-service-0-0: instance 'cloud-0' has been released, new status is IDLE
           INFO     dstack._internal.server.services.jobs:283 job(c1f616)my-service-0-0: job status is TERMINATED, reason: TERMINATED_BY_SERVER
           INFO     dstack._internal.server.services.runs:848 run(46bef6)my-service: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED

Additional information

It is important to keep the existing replica running to avoid service downtime. One of the ideas of having multiple replicas is increasing fault tolerance, so failures of one replica should not affect other replicas.

@jvstme jvstme added the bug Something isn't working label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant