Fix Flaky CI - Kill and Run #4455

AndrewJGaut · 2023-04-28T18:26:19Z

Fix flaky kill and run test (see #4433).

Still parsing the issue... appears to be related to bundle metadata being altered before the bundle-manager can acknowledge a run bundle as finished...

…ets PR is in.

AndrewJGaut · 2023-05-13T21:12:40Z

Hmm... Interestingly enough, now we're not getting any failures! In a full 25 runs! (See the GHA runs after the PR was closed) Very interesting...

…eets into fix-flaky-ci-KILL

…r only run

AndrewJGaut · 2023-05-16T22:16:56Z

The most recent run shows that it only fails when we do multiple tests -- e.g. search read kill -- rather than just run kill for a single test setup. Hmmm... I'm now re-running with even more debug statements and only with those test cases.

AndrewJGaut · 2023-05-17T05:06:59Z

Looking at the logs of a test failure, here's what we see:

At the bottom of the rest-server logs (where transition_bundle_finished is called), we see:

2023-05-17 05:02:22,851 {'uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e', 'run_status': 'Finalizing bundle.', 'bundle_start_time': 1684299727.1436937, 'container_time_total': 3.138313, 'container_time_user': 0, 'container_time_system': 0, 'docker_image': 'python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354', 'state': 'finalizing', 'remote': 'fv-az221-714', 'exitcode': None, 'failure_message': 'Kill requested', 'bundle_profile_stats': {'RUN_STAGE.PREPARING': {'start': 1684299727.1437242, 'end': 1684299732.7472246, 'elapsed': 5.6035003662109375}, 'RUN_STAGE.RUNNING': {'start': 1684299732.747232, 'end': 1684299737.7799726, 'elapsed': 5.032740592956543}, 'RUN_STAGE.CLEANING_UP': {'start': 1684299737.7799826, 'end': 1684299737.8078065, 'elapsed': 0.027823925018310547}, 'RUN_STAGE.UPLOADING_RESULTS': {'start': 1684299737.8078125, 'end': 1684299737.8927565, 'elapsed': 0.08494400978088379}, 'RUN_STAGE.FINALIZING': {'start': 1684299737.892763, 'end': None, 'elapsed': None}}, 'cpu_usage': 0.0, 'memory_usage': 0.070703125, 'disk_utilization': 4096}
2023-05-17 05:02:22,851 In transition_bundle_finalizing...

So, in particular, we see that 'failure_message': 'Kill requested' is logged as being within the worker_run dictionary in transition_bundle_finalizing. Yet, we see this at the bottom of the bundle-manager logs (in which transition_bundle_finished is run):

2023-05-17 05:02:23,170 In transition_bundle_finished
2023-05-17 05:02:23,171 {'uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e', 'bundle_type': 'run', 'command': 'sleep 100000;', 'state': 'finalizing', 'owner_id': '0', 'frozen': None, 'is_anonymous': False, 'storage_type': 'disk', 'is_dir': True, 'metadata': [{'metadata_key': 'name', 'metadata_value': 'run-sleep', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'description', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'created', 'metadata_value': '1684299727', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'data_size', 'metadata_value': '4096', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'allow_failed_dependencies', 'metadata_value': 'False', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_docker_image', 'metadata_value': 'python:3.6.10-slim-buster', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_time', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_memory', 'metadata_value': '10m', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_disk', 'metadata_value': '1m', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_cpus', 'metadata_value': '1', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_gpus', 'metadata_value': '0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_queue', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_priority', 'metadata_value': '0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_network', 'metadata_value': 'True', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'cpu_usage', 'metadata_value': '0.0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'memory_usage', 'metadata_value': '0.070703125', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'store', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'actions', 'metadata_value': 'kill', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time', 'metadata_value': '3.138313', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_user', 'metadata_value': '0.0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_system', 'metadata_value': '0.0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'started', 'metadata_value': '1684299727', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'last_updated', 'metadata_value': '1684299742', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'run_status', 'metadata_value': 'Finalizing bundle.', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'staged_status', 'metadata_value': "Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.", 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_preparing', 'metadata_value': '5.6035003662109375', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_running', 'metadata_value': '5.032740592956543', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_cleaning_up', 'metadata_value': '0.027823925018310547', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_uploading_results', 'metadata_value': '0.08494400978088379', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'docker_image', 'metadata_value': 'python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'remote', 'metadata_value': 'fv-az221-714', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'remote_history', 'metadata_value': 'f4a4bd80c899', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'on_preemptible_worker', 'metadata_value': 'False', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}], 'dependencies': []}
2023-05-17 05:02:23,171 {'actions': ['kill'], 'allow_failed_dependencies': False, 'cpu_usage': 0.0, 'created': 1684299727, 'data_size': 4096, 'description': '', 'docker_image': 'python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354', 'exclude_patterns': [], 'last_updated': 1684299742, 'memory_usage': 0.070703125, 'name': 'run-sleep', 'on_preemptible_worker': False, 'remote': 'fv-az221-714', 'remote_history': ['f4a4bd80c899'], 'request_cpus': 1, 'request_disk': '1m', 'request_docker_image': 'python:3.6.10-slim-buster', 'request_gpus': 0, 'request_memory': '10m', 'request_network': True, 'request_priority': 0, 'request_queue': '', 'request_time': '', 'run_status': 'Finalizing bundle.', 'staged_status': "Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.", 'started': 1684299727, 'store': '', 'tags': [], 'time': 3.138313, 'time_cleaning_up': 0.027823925018310547, 'time_preparing': 5.6035003662109375, 'time_running': 5.032740592956543, 'time_system': 0.0, 'time_uploading_results': 0.08494400978088379, 'time_user': 0.0}
2023-05-17 05:02:23,171 None
2023-05-17 05:02:23,172 0
2023-05-17 05:02:23,172 ready

What we see is that the bundle.metadata dict has no failure message! And the logged failure message is None. Interesting~

AndrewJGaut · 2023-05-17T06:59:09Z

OK, now that I've added more logging, I see that at the end of transition_bundle_finalizing the bundle.metadata is:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684306433,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "failure_message":"Kill requested",
   "last_updated":1684306448,
   "memory_usage":0.0703125,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az592-777",
   "remote_history":[
      "3f7310a8fd23"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684306433,
   "store":"",
   "tags":[
      
   ],
   "time":3.096053,
   "time_cleaning_up":0.025354385375976562,
   "time_preparing":5.472713470458984,
   "time_running":5.025737524032593,
   "time_system":0,
   "time_uploading_results":0.12300682067871094,
   "time_user":0
}

And, at the beginning of transition_bundle_finished, the bundle metadata is:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "created":1684306433,
   "description":"",
   "exclude_patterns":[
      
   ],
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote_history":[
      "3f7310a8fd23"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684306433,
   "store":"",
   "tags":[
      
   ]
}

Now, we're getting somewhere. The failure message is wiped from the metadata and the metadata itself changes significantly. Looks like something else is updating the bundle prematurely before transition_bundle_finished gets to it!

AndrewJGaut · 2023-05-17T08:09:22Z

Note: I thought maybe it was due to running transition_bundle_running again before transition_bundle_finished, but the bundle metadata at the end of that function looks like so:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684309050,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "last_updated":1684309071,
   "memory_usage":0.069140625,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az842-109",
   "remote_history":[
      "fd7b126df007"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684309050,
   "store":"",
   "tags":[
      
   ],
   "time":3.109395,
   "time_cleaning_up":0.022008895874023438,
   "time_preparing":5.538380861282349,
   "time_running":5.032529592514038,
   "time_system":0,
   "time_uploading_results":10.242918252944946,
   "time_user":0
}

which makes me think that's not the issue. Hmm... it's definitely a race condition because it's failing intermittently and it's due to a bundle update happening somewhere else...

AndrewJGaut · 2023-05-17T08:36:42Z

I don't see where that extra bundle update is occurring...

No rest endpoint is being hit (I don't see anything in the worker logs indicating it's doing anything and obviously there's no action from a client)
I don't see anywhere that metadata update could occur in either bundle_checkin or in the bundle manager's four tasks it does every iteration (see here).

So, I'm a bit perplexed as of right now... Where else could a bundle metadata update arise...

AndrewJGaut · 2023-05-17T20:25:30Z

Note: For a correct run, we have the following:

bundle metadata at end of transition_bundle_finalizing:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684308916,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "failure_message":"Kill requested",
   "last_updated":1684308936,
   "memory_usage":0.069921875,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az397-676",
   "remote_history":[
      "3246b8975815"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684308916,
   "store":"",
   "tags":[
      
   ],
   "time":7.116782,
   "time_cleaning_up":5.061987400054932,
   "time_preparing":0.8744297027587891,
   "time_running":9.032560110092163,
   "time_system":0,
   "time_uploading_results":0.1380014419555664,
   "time_user":0
}

bundle metadata at end of transition_bundle_finished:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684308916,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "failure_message":"Kill requested",
   "last_updated":1684308936,
   "memory_usage":0.069921875,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az397-676",
   "remote_history":[
      "3246b8975815"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684308916,
   "store":"",
   "tags":[
      
   ],
   "time":7.116782,
   "time_cleaning_up":5.061987400054932,
   "time_preparing":0.8744297027587891,
   "time_running":9.032560110092163,
   "time_system":0.0,
   "time_uploading_results":0.1380014419555664,
   "time_user":0.0
}

AndrewJGaut · 2023-05-23T17:23:16Z

OK, one observation: the bundle manager logs for successful and failed runs have a difference. Before the run that is killed and ultimately fails to be registered as killed for the failed runs, we have the following logs for successful runs:

Staging <uuid>
Pinged worker through websockets, worker id: <id>
Starting run bundle <uuid> on worker <id>
Pinged worker through websockets, worker id: <id>
Acknowledged finalization of run bundle <uuid> on worker <id>

and we get the following logs for failed runs:

Staging <uuid>
Pinged worker through websockets, worker id: <id>
Starting run bundle <uuid> on worker <id>
Pinged worker through websockets, worker id: <id>
socket error when calling send_json_message: [Errno 111] Connection refused
Sleeping for 0.1 seconds.
Acknowledged finalization of run bundle <uuid> on worker <id>

All the failed runs I've seen have at least one Connection refused error at that point in the logs and all the successful runs I've seen don't have it.

…eets into fix-flaky-ci-KILL

…updates and I want to see exactly what's being updated

AndrewJGaut · 2023-07-23T01:10:30Z

I was hoping that #4506 would fix this, but it appears that it doesn't since that's been merged to master and after merging master in here we still get errors. As such, I'm still looking for the culprit...

The fact that we fixed that error makes me more confident that this issue isn't due to a race condition resulting from DB transaction ordering and is more ilkely due to an update happening to the bundle somewhere I haven't seen yet... just added some more logging.

One other thing I'm trying: I added in a sleep to try and induce the socket error when calling send_json_message: [Errno 111] Connection refused Sleeping for 0.1 seconds. error every time. If that causes the error to occur every time, then I'll be able to reliably reproduce the error locally and that'll make it WAY easier to debug.

…issue. Also clean up logging a little bit

Make kill tests pass more often. We could also wait until the websock…

9342d55

…ets PR is in.

AndrewJGaut changed the title ~~Make kill tests pass more often. We could also wait until the websock…~~ Fix Flaky CI - Kill and Run Apr 28, 2023

AndrewJGaut and others added 2 commits April 29, 2023 16:53

minor logging changes

093e17f

add in logging and run kill a bunch of times

eb2208e

epicfaace closed this May 11, 2023

update GHA to just run kill a bunch of times

36c261a

AndrewJGaut reopened this May 11, 2023

AndrewJGaut and others added 4 commits May 10, 2023 17:55

remove needs

4235045

add Build back in

a3f2a1e

add in run tests too

adb113e

Merge branch 'master' into fix-flaky-ci-KILL

655d35f

AndrewJGaut and others added 3 commits May 13, 2023 14:21

change up tests a bit, add in az blob

a0d0ca4

Merge branch 'fix-flaky-ci-KILL' of github.com:codalab/codalab-worksh…

347cb29

…eets into fix-flaky-ci-KILL

Added more debug logging and took out all tests that were only kill o…

f99c799

…r only run

AndrewJGaut added 4 commits May 16, 2023 15:39

add in saving logs for all runs, since that's important here

1df2fca

slight change in comments

b8a44fc

minor issue fixed

5e568b8

add tests

5f9909e

Add some more logging

dbc7c24

add more logging

d35a44d

Slight modification

25ff0e3

Add in more logs... centering in on issue, perhaps

6b3b222

AndrewJGaut and others added 8 commits May 23, 2023 10:55

Merge branch 'master' into fix-flaky-ci-KILL

789055e

slight change

8ef9e1c

Merge branch 'fix-flaky-ci-KILL' of github.com:codalab/codalab-worksh…

1d121e3

…eets into fix-flaky-ci-KILL

add slightly more logging

a14f52f

minor logging update

6d22f98

Merge branch 'fix-flaky-ci-KILL' of github.com:codalab/codalab-worksh…

a15378c

…eets into fix-flaky-ci-KILL

Merge branch 'master' into fix-flaky-ci-KILL

eb6a778

Add in more logging for bundle updates b/c there are some suspicious …

8001353

…updates and I want to see exactly what's being updated

AndrewJGaut added 3 commits July 22, 2023 21:27

Add in sleep at beginning of send json message to try and induce the …

4abe351

…issue. Also clean up logging a little bit

Remove the sleep

34ff5ef

merge master

e455013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Flaky CI - Kill and Run #4455

Fix Flaky CI - Kill and Run #4455

AndrewJGaut commented Apr 28, 2023 •

edited

Loading

AndrewJGaut commented May 13, 2023

AndrewJGaut commented May 16, 2023

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023

AndrewJGaut commented May 23, 2023

AndrewJGaut commented Jul 23, 2023 •

edited

Loading

Fix Flaky CI - Kill and Run #4455

Are you sure you want to change the base?

Fix Flaky CI - Kill and Run #4455

Conversation

AndrewJGaut commented Apr 28, 2023 • edited Loading

AndrewJGaut commented May 13, 2023

AndrewJGaut commented May 16, 2023

AndrewJGaut commented May 17, 2023 • edited Loading

AndrewJGaut commented May 17, 2023 • edited Loading

AndrewJGaut commented May 17, 2023 • edited Loading

AndrewJGaut commented May 17, 2023 • edited Loading

AndrewJGaut commented May 17, 2023

AndrewJGaut commented May 23, 2023

AndrewJGaut commented Jul 23, 2023 • edited Loading

AndrewJGaut commented Apr 28, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented May 17, 2023 •

edited

Loading

AndrewJGaut commented Jul 23, 2023 •

edited

Loading