-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vine: measure and report worker sandbox used #3889
vine: measure and report worker sandbox used #3889
Conversation
Isn't the measurement already done in the function |
@btovar It is not guaranteed that the measurements from the call in the loop will be up to date or present at all. I tested with some of the examples before the change and the sandbox_size field was never populated before the task completes. I'm not sure if it was a sequence issue or if the process spent very little time in the procs_running list |
I'd caution against measuring more often. Measuring the disk greatly increases the overhead. We even added a check for that in Further, by the time you are doing this measurement the task is already ready to be sent back. I worry about regular python tasks that ran for a few seconds that then have to wait Another thing to consider is that there Given all that, I'm not sure what benefit we get by measuring the sandboxes in this way. Once the task is done, we only care about the size of the cache, right? |
I agree the extra overhead is concerning. The idea was that we wanted a more accurate idea of sandbox usage in order to provide a more accurate task disk allocation. The cache/input files are to be considered separately since they are linked into the sandbox and not counted towards the size used by the task. Plus some of them might already exist in the cache and not require any more disk. Since it is difficult to individually track and measure intermediate products and outputs of a task, measuring the amount of sandbox used at the end would let us continue to schedule something like this: can_schedule ? worker_disk_available >= required input size + category_sandbox_minimum : false |
I'm concerned that this measurement is going to create a train effect where tasks that finish around the same time have to wait for the measurement of others. Is there a way you can move the measurement for when the process finishes so that it can be done in parallel? Also, I would have it disabled by default. My strong hunch is that this will increase the time the tasks wait to be retrieved. If the problem is that tasks get little disk at the end, what about dividing the disk of the worker? Say, reserve x% of the disk for task sandboxes? |
It appears there was a bug in enforce_process_limits where we were looking at a disk field attached to the process rather than the individual task. I am not sure of the usage of process->disk, however it doesn't seem to get populated. With this change, we seem to measure disk often enough in the loop that it will report accurate values back to the manager. e.g. I removed the disk measurement after each completed task |
There is a function vine_process_compute_disk_needed that populates it but is never called |
nice catch! |
So does the sandbox measurement match our current understanding by skipping over linked input items? |
It seems we hard link from the cache unless otherwise specified, and it will not skip over hard links, so not yet. Looking at this unused code though, I am wondering about this paradigm where the worker is responsible for reducing the sandbox allocation by subtracting the input files. It does not seem like a bad idea. |
This is one possible solution to accurately measuring the sandbox avoiding recursive stats of hard linked input files/directories. I have tested this to successfully measure only task outputs in the sandbox (declared or undeclared) and reporting the disk consumed back to the manager in the complete message. Since the sandbox measurement is only periodic there can be inaccurate reporting for very short running tasks or some other specific scenarios. I am relying on the assumption that tasks in danger of filling up their sandbox are probably long running. |
If this is RTM, please check the appropriate boxes in the PR... |
@btovar does that change work for you? |
Ok, per our discussion today, go ahead and add controls for disk management separately so it makes sense to you. Once that's all working, we will ask Ben for some advice on how to roll it into the existing resource management framework. I think what we have learned is that disk is a critical resource (things fail if it is exhausted) and so managing cannot be optional. The worker has to be measuring the cache and the sandboxes even if |
I confirmed that this is working to update the minimum allocation described in the category as tasks complete. Even without the resource monitor and other category features we still have a handle to the default category attached to the tasks, and the field I added can be referenced as needed. So the proportional resources are still operating as normal. But if the disk is a constrained environment, it will keep allocating portions of the disk until it reaches a minimum which is largest sandbox size we have seen for the category. This is all in the default case, where no resources are requested by the user and the tasks are not assigned a specific category. It would be interesting to have a setting to try and pack the allocations as close to the minimum as possible, but I'm not sure how useful it would be to try and budget sandbox space on running tasks, unless there is a use case for a lot of overcommit on cores. |
@btovar can you help us out on this PR? After some long discussions, I think that Colin has the right policy here: The technical piece I am not sure about is Can you suggest a way to resolve the latter bit? |
One small amendment to the policy you described:
The frequency of disk measurement has not changed. We currently do it at most every 5 seconds. So short running tasks may finish before we measure them, or we may get one measurement from a task before it generates some output file and we therefore under-report. We hold onto the max value seen with the hope that we will eventually measure at the right time. It is imperfect, but in a previous conversation we ruled out the idea of a disk measurement occurring at the end of each task. Perhaps we could bump up the frequency without too much trouble. |
otherwise FIXED allocations that did not specify disk are not retried.
I think this one is ready. Colin, does it work for you? |
Looks good to me! |
…ab#3889) * measure sandbox on completion and report to manageR * fix disk check bug * remove task inputs from sandbox measurement * remove debug * update measurement strategy * format * refer to sandbox usage for category minimum * format * move max_disk_use to vine_stats->max_sandbox * use hash_table to encode excluded paths * minimum disk requested is always at least input size * do not add one to sandbox_used * pad with at least bucket_size/2 as we do for first allocations * update wq * treat sandbox as any other disk allocation * add VINE_RESULT_SANDBOX_EXHAUSTION * use padded sandbox as min * treat sandboxes differently than regular allocations otherwise FIXED allocations that did not specify disk are not retried. * format * add API call to enable/disable proportional resources * api * consider min resources for proportion * use user values * disable whole task proportion if specific resource was given * rename max_sandbox -> min_sandbox * do not double measure disk * set minimum disk usage from sandbox * update comments * disk only specified output and ephemeral files * proportion with available disk - cached * use already inuse_cache value in count_worker_resources * use available_disk with t->input_files_size * turn off prop whole tasks only for mem and disk * check for user resources not specified * fix conflict, input_size * format * macros to conver bytes to megabytes, etc. * correctly account for inuse_cache units * add DISK to vine_worker test * format --------- Co-authored-by: Benjamin Tovar <[email protected]>
Proposed Changes
At the worker, measure sandbox on completed processes and send this info back to manager.
This also included work on the method the worker uses to measure the sandbox. Before it was including hard linked input files, which do not belong in the sandbox measurement.
Merge Checklist
The following items must be completed before PRs can be merge.
Check these off to verify you have completed all steps.
make test
Run local tests prior to pushing.make format
Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)make lint
Run lint on source code prior to pushing.