-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove assign_cpu_and_gpu_sets #4412
base: master
Are you sure you want to change the base?
Conversation
% (request_gpus, len(gpuset), len(self.gpuset)) | ||
) | ||
|
||
def propose_set(resource_set, request_count): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance that the user will actually get less CPUs / GPUs than they request? If so, then we might need to keep this function.
We should discuss this in a meeting. |
@@ -237,19 +237,7 @@ def mount_dependency(dependency, shared_file_system): | |||
) | |||
return run_state._replace(stage=RunStage.CLEANING_UP) | |||
|
|||
# Check CPU and GPU availability | |||
try: | |||
cpuset, gpuset = self.assign_cpu_and_gpu_sets_fn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an if statement -- consider just not doing this with kubernetes.
@AndrewJGaut any thoughts on this? |
I believe you're correct @epicfaace and Here's why: We do check if the worker has enough resources in the bundle-manager and keep a tally on the worker's cpus and gpus (e.g. see here). Now, we could run into issues if, for instance, the bundle_manager was starting bundles in a multi-threaded fashion (and sent a start message to the same worker multiple times before decrementing the worker['cpus'] and/or worker['gpus'] in that function).However, the condition for entering the loop in which those resources are decremented will not return until the rest-server has received the message and sent it to the worker has received the message (or it returns False and no bundle is started anyway). Therefore, it will never be the case as long as bundle_manager is single-threaded that there will be a race condition related to worker cpu and gpu counts. Since the bundle-manager will never schedule a run on the worker unless it has sufficient resources for all of its running bundles, it should never be the case that there aren't enough cpus and/or gpus on the worker to run a bundle. Therefore, the |
Some extra interesting tidbits:
|
Reasons for making this change
Remove assign_cpu_and_gpu_sets as I don't think this check is needed. This is important as it simplifies #4385, because workers won't need to do checks on their own cpu / gpusets (they can just request pods with specific GPUs). I could have just bypassed this code path for the kubernetes runtime, but it seems to be an unnecessary check we can just do away with -- because the bundle manager already handles ensuring the worker has enough CPUs / GPUs available to run a bundle. What do you think @AndrewJGaut @percyliang ?