Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefect-gcp: Cloud Run v2 job definitions are not cleaned up after successful flow run #16007

Open
calebhskim opened this issue Nov 13, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@calebhskim
Copy link

calebhskim commented Nov 13, 2024

Bug summary

We are using a Cloud Run v2 pull work pool to execute deployed flows in Cloud Run V2 jobs and have noticed that upon successful container exit and successful flow exit the Cloud Run job definition is occasionally not deleted. This causes us to run up against the 1000 job definition limit in Cloud Run causing other deployed flows to not be submitted. We observe in the worker logs that for most flow runs there is a delete request submitted to Cloud Run.

Successful Delete logs

Prefect Worker log for deleted job definition

DEFAULT 2024-11-13T18:12:49.216484Z 18:12:49.216 | INFO | prefect.flow_runs.worker - Creating Cloud Run JobV2 warm-scorpion-<FLOW_ID>
DEFAULT 2024-11-13T18:12:59.592780Z 18:12:59.593 | INFO | prefect.flow_runs.worker - Submitting Cloud Run Job V2 warm-scorpion-<FLOW_ID> for execution...
DEFAULT 2024-11-13T18:12:59.784565Z 18:12:59.784 | INFO | prefect.flow_runs.worker - Cloud Run Job V2 warm-scorpion-<FLOW_ID> submitted for execution with command: p r e f e c t f l o w - r u n e x e c u t e
DEFAULT 2024-11-13T18:18:12.420228Z 18:18:12.420 | INFO | prefect.flow_runs.worker - Cloud Run Job V2 warm-scorpion-<FLOW_ID> succeeded
DEFAULT 2024-11-13T18:18:12.421823Z 18:18:12.421 | INFO | prefect.flow_runs.worker - Job run logs can be found on GCP at: https://console.cloud.google.com/logs/viewer?...
DEFAULT 2024-11-13T18:18:12.423120Z 18:18:12.423 | INFO | prefect.flow_runs.worker - Deleting completed Cloud Run Job 'warm-scorpion-<FLOW_ID>' from Google Cloud Run...

Prefect UI Job log for deleted job

Cloud Run Job V2 warm-scorpion-<FLOW_ID> submitted for execution with command: p r e f e c t   f l o w - r u n   e x e c u t e
10:12:59 AM
prefect.flow_runs.worker
Completed submission of flow run '<FLOW_RUN_ID>'
10:12:59 AM
prefect.flow_runs.worker
Opening process...
10:13:26 AM
prefect.flow_runs.runner
Uploading blob named <BLOB NAME> to the <BUCKET NAME> bucket
10:17:42 AM
prefect.flow_runs
Finished in state Completed()
10:17:42 AM
prefect.flow_runs
Process for flow run 'warm-scorpion' exited cleanly.
10:17:46 AM
prefect.flow_runs.runner
Cloud Run Job V2 warm-scorpion-<JOB_ID> succeeded
10:18:12 AM
prefect.flow_runs.worker
Job run logs can be found on GCP at: https://console.cloud.google.com/logs/viewer?...
10:18:12 AM
prefect.flow_runs.worker
Deleting completed Cloud Run Job 'warm-scorpion-<JOB_ID>' from Google Cloud Run...

Cloud Run API Audit log: Corresponding delete event for job definition

audit_log, method: "google.cloud.run.v2.Jobs.DeleteJob"

Unsuccessful Delete logs

For flow runs that do not get deleted we observe that the container and flow exit successfully but have no associated delete request in the worker:

Prefect worker logs for non-deleted job definition

DEFAULT 2024-11-07T23:59:13.787708Z 23:59:13.788 | INFO | prefect.flow_runs.worker - Creating Cloud Run JobV2 clay-wolverine-<JOB_ID>
DEFAULT 2024-11-07T23:59:24.232438Z 23:59:24.232 | INFO | prefect.flow_runs.worker - Submitting Cloud Run Job V2 clay-wolverine-<JOB_ID> for execution...
DEFAULT 2024-11-07T23:59:24.478259Z 23:59:24.478 | INFO | prefect.flow_runs.worker - Cloud Run Job V2 clay-wolverine-<JOB_ID> submitted for execution with command: p r e f e c t f l o w - r u n e x e c u t e

Prefect UI job logs for non-deleted job definition

Cloud Run Job V2 clay-wolverine-<JOB_ID> submitted for execution with command: p r e f e c t   f l o w - r u n   e x e c u t e
03:59:24 PM
prefect.flow_runs.worker
Completed submission of flow run '<FLOW_RUN_ID>'
03:59:24 PM
prefect.flow_runs.worker
Opening process...
03:59:51 PM
prefect.flow_runs.runner
Uploading blob named <BLOB_NAME> to the <BUCKET_NAME> bucket
04:03:43 PM
prefect.flow_runs
Finished in state Completed()
04:03:43 PM
prefect.flow_runs
Process for flow run 'clay-wolverine' exited cleanly.

For our deployed flows we use the default job variable setting keep_job: false. Our current workaround is to have a scheduled job cleanup "stale" job definitions. This also happens for failed job runs as well. Any help here or pointers to something potentially misconfigured on our end would be helpful.

Version info

Version:             3.0.10
API version:         0.8.4
Python version:      3.11.9
Git commit:          3aa2d893
Built:               Tue, Oct 15, 2024 1:31 PM
OS/Arch:             darwin/arm64
Profile:             ephemeral
Server type:         ephemeral
Pydantic version:    2.9.2
Server:
  Database:          sqlite
  SQLite version:    3.46.0
Integrations:
  prefect-gcp:       0.6.1
  prefect-docker:    0.6.1

Additional context

I did see this ticket #14525 which is also open and seems to be related

@calebhskim calebhskim added the bug Something isn't working label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant