-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Flaky CI - Kill and Run #4455
base: master
Are you sure you want to change the base?
Conversation
Hmm... Interestingly enough, now we're not getting any failures! In a full 25 runs! (See the GHA runs after the PR was closed) Very interesting... |
The most recent run shows that it only fails when we do multiple tests -- e.g. |
Looking at the logs of a test failure, here's what we see: At the bottom of the rest-server logs (where
So, in particular, we see that
What we see is that the |
OK, now that I've added more logging, I see that at the end of {
"actions":[
"kill"
],
"allow_failed_dependencies":false,
"cpu_usage":0.0,
"created":1684306433,
"data_size":4096,
"description":"",
"docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
"exclude_patterns":[
],
"failure_message":"Kill requested",
"last_updated":1684306448,
"memory_usage":0.0703125,
"name":"run-sleep",
"on_preemptible_worker":false,
"remote":"fv-az592-777",
"remote_history":[
"3f7310a8fd23"
],
"request_cpus":1,
"request_disk":"1m",
"request_docker_image":"python:3.6.10-slim-buster",
"request_gpus":0,
"request_memory":"10m",
"request_network":true,
"request_priority":0,
"request_queue":"",
"request_time":"",
"run_status":"Finalizing bundle.",
"staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
"started":1684306433,
"store":"",
"tags":[
],
"time":3.096053,
"time_cleaning_up":0.025354385375976562,
"time_preparing":5.472713470458984,
"time_running":5.025737524032593,
"time_system":0,
"time_uploading_results":0.12300682067871094,
"time_user":0
} And, at the beginning of {
"actions":[
"kill"
],
"allow_failed_dependencies":false,
"created":1684306433,
"description":"",
"exclude_patterns":[
],
"name":"run-sleep",
"on_preemptible_worker":false,
"remote_history":[
"3f7310a8fd23"
],
"request_cpus":1,
"request_disk":"1m",
"request_docker_image":"python:3.6.10-slim-buster",
"request_gpus":0,
"request_memory":"10m",
"request_network":true,
"request_priority":0,
"request_queue":"",
"request_time":"",
"staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
"started":1684306433,
"store":"",
"tags":[
]
} Now, we're getting somewhere. The failure message is wiped from the metadata and the metadata itself changes significantly. Looks like something else is updating the bundle prematurely before |
Note: I thought maybe it was due to running {
"actions":[
"kill"
],
"allow_failed_dependencies":false,
"cpu_usage":0.0,
"created":1684309050,
"data_size":4096,
"description":"",
"docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
"exclude_patterns":[
],
"last_updated":1684309071,
"memory_usage":0.069140625,
"name":"run-sleep",
"on_preemptible_worker":false,
"remote":"fv-az842-109",
"remote_history":[
"fd7b126df007"
],
"request_cpus":1,
"request_disk":"1m",
"request_docker_image":"python:3.6.10-slim-buster",
"request_gpus":0,
"request_memory":"10m",
"request_network":true,
"request_priority":0,
"request_queue":"",
"request_time":"",
"run_status":"Finalizing bundle.",
"staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
"started":1684309050,
"store":"",
"tags":[
],
"time":3.109395,
"time_cleaning_up":0.022008895874023438,
"time_preparing":5.538380861282349,
"time_running":5.032529592514038,
"time_system":0,
"time_uploading_results":10.242918252944946,
"time_user":0
} which makes me think that's not the issue. Hmm... it's definitely a race condition because it's failing intermittently and it's due to a bundle update happening somewhere else... |
I don't see where that extra bundle update is occurring...
So, I'm a bit perplexed as of right now... Where else could a bundle metadata update arise... |
Note: For a correct run, we have the following: bundle metadata at end of {
"actions":[
"kill"
],
"allow_failed_dependencies":false,
"cpu_usage":0.0,
"created":1684308916,
"data_size":4096,
"description":"",
"docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
"exclude_patterns":[
],
"failure_message":"Kill requested",
"last_updated":1684308936,
"memory_usage":0.069921875,
"name":"run-sleep",
"on_preemptible_worker":false,
"remote":"fv-az397-676",
"remote_history":[
"3246b8975815"
],
"request_cpus":1,
"request_disk":"1m",
"request_docker_image":"python:3.6.10-slim-buster",
"request_gpus":0,
"request_memory":"10m",
"request_network":true,
"request_priority":0,
"request_queue":"",
"request_time":"",
"run_status":"Finalizing bundle.",
"staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
"started":1684308916,
"store":"",
"tags":[
],
"time":7.116782,
"time_cleaning_up":5.061987400054932,
"time_preparing":0.8744297027587891,
"time_running":9.032560110092163,
"time_system":0,
"time_uploading_results":0.1380014419555664,
"time_user":0
} bundle metadata at end of {
"actions":[
"kill"
],
"allow_failed_dependencies":false,
"cpu_usage":0.0,
"created":1684308916,
"data_size":4096,
"description":"",
"docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
"exclude_patterns":[
],
"failure_message":"Kill requested",
"last_updated":1684308936,
"memory_usage":0.069921875,
"name":"run-sleep",
"on_preemptible_worker":false,
"remote":"fv-az397-676",
"remote_history":[
"3246b8975815"
],
"request_cpus":1,
"request_disk":"1m",
"request_docker_image":"python:3.6.10-slim-buster",
"request_gpus":0,
"request_memory":"10m",
"request_network":true,
"request_priority":0,
"request_queue":"",
"request_time":"",
"run_status":"Finalizing bundle.",
"staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
"started":1684308916,
"store":"",
"tags":[
],
"time":7.116782,
"time_cleaning_up":5.061987400054932,
"time_preparing":0.8744297027587891,
"time_running":9.032560110092163,
"time_system":0.0,
"time_uploading_results":0.1380014419555664,
"time_user":0.0
} |
OK, one observation: the bundle manager logs for successful and failed runs have a difference. Before the run that is killed and ultimately fails to be registered as killed for the failed runs, we have the following logs for successful runs:
and we get the following logs for failed runs:
All the failed runs I've seen have at least one |
…eets into fix-flaky-ci-KILL
…eets into fix-flaky-ci-KILL
…updates and I want to see exactly what's being updated
I was hoping that #4506 would fix this, but it appears that it doesn't since that's been merged to master and after merging master in here we still get errors. As such, I'm still looking for the culprit... The fact that we fixed that error makes me more confident that this issue isn't due to a race condition resulting from DB transaction ordering and is more ilkely due to an update happening to the bundle somewhere I haven't seen yet... just added some more logging. One other thing I'm trying: I added in a sleep to try and induce the |
…issue. Also clean up logging a little bit
Fix flaky
kill
andrun
test (see #4433).Still parsing the issue... appears to be related to bundle metadata being altered before the
bundle-manager
can acknowledge a run bundle as finished...