Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: measure and report worker sandbox used #3889

Merged

Conversation

colinthomas-z80
Copy link
Contributor

@colinthomas-z80 colinthomas-z80 commented Jul 23, 2024

Proposed Changes

At the worker, measure sandbox on completed processes and send this info back to manager.

This also included work on the method the worker uses to measure the sandbox. Before it was including hard linked input files, which do not belong in the sandbox measurement.

Merge Checklist

The following items must be completed before PRs can be merge.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update Update the manual to reflect user-visible changes.
  • Type Labels Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM Mark your PR as ready to merge.

@btovar
Copy link
Member

btovar commented Jul 23, 2024

Isn't the measurement already done in the function enforce_process_limits?

@colinthomas-z80
Copy link
Contributor Author

@btovar It is not guaranteed that the measurements from the call in the loop will be up to date or present at all.

I tested with some of the examples before the change and the sandbox_size field was never populated before the task completes. I'm not sure if it was a sequence issue or if the process spent very little time in the procs_running list

@btovar
Copy link
Member

btovar commented Jul 23, 2024

I'd caution against measuring more often. Measuring the disk greatly increases the overhead. We even added a check for that in enforce_process_limits.

Further, by the time you are doing this measurement the task is already ready to be sent back. I worry about regular python tasks that ran for a few seconds that then have to wait max_time_on_measurement for their sandboxes to be measured.

Another thing to consider is that there vine_process_measure_disk(p, options->max_time_on_measurement); will only give you correct results for sandboxes that are small in the number of files. If there are many files, the measurement will quit before max_time_on_measurement This function was meant to be used for tasks that ran long when the disk could be progressively measured.

Given all that, I'm not sure what benefit we get by measuring the sandboxes in this way. Once the task is done, we only care about the size of the cache, right?

@colinthomas-z80
Copy link
Contributor Author

I agree the extra overhead is concerning. The idea was that we wanted a more accurate idea of sandbox usage in order to provide a more accurate task disk allocation.

The cache/input files are to be considered separately since they are linked into the sandbox and not counted towards the size used by the task. Plus some of them might already exist in the cache and not require any more disk.

Since it is difficult to individually track and measure intermediate products and outputs of a task, measuring the amount of sandbox used at the end would let us continue to schedule something like this:

can_schedule ? worker_disk_available >= required input size + category_sandbox_minimum : false

@btovar
Copy link
Member

btovar commented Jul 23, 2024

I'm concerned that this measurement is going to create a train effect where tasks that finish around the same time have to wait for the measurement of others. Is there a way you can move the measurement for when the process finishes so that it can be done in parallel? Also, I would have it disabled by default. My strong hunch is that this will increase the time the tasks wait to be retrieved.

If the problem is that tasks get little disk at the end, what about dividing the disk of the worker? Say, reserve x% of the disk for task sandboxes?

@colinthomas-z80
Copy link
Contributor Author

colinthomas-z80 commented Jul 24, 2024

It appears there was a bug in enforce_process_limits where we were looking at a disk field attached to the process rather than the individual task. I am not sure of the usage of process->disk, however it doesn't seem to get populated.

With this change, we seem to measure disk often enough in the loop that it will report accurate values back to the manager.

e.g. I removed the disk measurement after each completed task

@colinthomas-z80
Copy link
Contributor Author

There is a function vine_process_compute_disk_needed that populates it but is never called

@btovar
Copy link
Member

btovar commented Jul 24, 2024

nice catch!

@dthain
Copy link
Member

dthain commented Jul 24, 2024

So does the sandbox measurement match our current understanding by skipping over linked input items?

@colinthomas-z80
Copy link
Contributor Author

It seems we hard link from the cache unless otherwise specified, and it will not skip over hard links, so not yet.

Looking at this unused code though, I am wondering about this paradigm where the worker is responsible for reducing the sandbox allocation by subtracting the input files. It does not seem like a bad idea.

@colinthomas-z80
Copy link
Contributor Author

This is one possible solution to accurately measuring the sandbox avoiding recursive stats of hard linked input files/directories.

I have tested this to successfully measure only task outputs in the sandbox (declared or undeclared) and reporting the disk consumed back to the manager in the complete message.

Since the sandbox measurement is only periodic there can be inaccurate reporting for very short running tasks or some other specific scenarios. I am relying on the assumption that tasks in danger of filling up their sandbox are probably long running.

@dthain
Copy link
Member

dthain commented Aug 7, 2024

If this is RTM, please check the appropriate boxes in the PR...

@colinthomas-z80
Copy link
Contributor Author

@btovar does that change work for you?

@dthain
Copy link
Member

dthain commented Aug 15, 2024

Ok, per our discussion today, go ahead and add controls for disk management separately so it makes sense to you.

Once that's all working, we will ask Ben for some advice on how to roll it into the existing resource management framework.

I think what we have learned is that disk is a critical resource (things fail if it is exhausted) and so managing cannot be optional. The worker has to be measuring the cache and the sandboxes even if q->monitoring_enabled is false.

@colinthomas-z80
Copy link
Contributor Author

colinthomas-z80 commented Aug 16, 2024

I confirmed that this is working to update the minimum allocation described in the category as tasks complete. Even without the resource monitor and other category features we still have a handle to the default category attached to the tasks, and the field I added can be referenced as needed.

So the proportional resources are still operating as normal. But if the disk is a constrained environment, it will keep allocating portions of the disk until it reaches a minimum which is largest sandbox size we have seen for the category. This is all in the default case, where no resources are requested by the user and the tasks are not assigned a specific category.

It would be interesting to have a setting to try and pack the allocations as close to the minimum as possible, but I'm not sure how useful it would be to try and budget sandbox space on running tasks, unless there is a use case for a lot of overcommit on cores.

@dthain
Copy link
Member

dthain commented Aug 30, 2024

@btovar can you help us out on this PR?

After some long discussions, I think that Colin has the right policy here:
0 - The disk usage of a task is defined as the size of intermediate and output files in the sandbox. (excluding input files)
1 - When disk usage is unknown, allocate half of what's remaining, rather than allocating all of it.
2 - Always measure disk usage at task completion.

The technical piece I am not sure about is t->max_disk_use which seems to me like it aliases t->resources->disk, and I am concerned that confusion will result down the road.

Can you suggest a way to resolve the latter bit?

@colinthomas-z80
Copy link
Contributor Author

One small amendment to the policy you described:

2 - Always measure disk usage at task completion.

The frequency of disk measurement has not changed. We currently do it at most every 5 seconds. So short running tasks may finish before we measure them, or we may get one measurement from a task before it generates some output file and we therefore under-report.

We hold onto the max value seen with the hope that we will eventually measure at the right time. It is imperfect, but in a previous conversation we ruled out the idea of a disk measurement occurring at the end of each task. Perhaps we could bump up the frequency without too much trouble.

@btovar
Copy link
Member

btovar commented Sep 27, 2024

I think this one is ready. Colin, does it work for you?

@colinthomas-z80
Copy link
Contributor Author

Looks good to me!

@btovar btovar merged commit 24520c9 into cooperative-computing-lab:master Sep 30, 2024
8 checks passed
colinthomas-z80 added a commit to colinthomas-z80/cctools that referenced this pull request Oct 15, 2024
…ab#3889)

* measure sandbox on completion and report to manageR

* fix disk check bug

* remove task inputs from sandbox measurement

* remove debug

* update measurement strategy

* format

* refer to sandbox usage for category minimum

* format

* move max_disk_use to vine_stats->max_sandbox

* use hash_table to encode excluded paths

* minimum disk requested is always at least input size

* do not add one to sandbox_used

* pad with at least bucket_size/2 as we do for first allocations

* update wq

* treat sandbox as any other disk allocation

* add VINE_RESULT_SANDBOX_EXHAUSTION

* use padded sandbox as min

* treat sandboxes differently than regular allocations

otherwise FIXED allocations that did not specify disk are not retried.

* format

* add API call to enable/disable proportional resources

* api

* consider min resources for proportion

* use user values

* disable whole task proportion if specific resource was given

* rename max_sandbox -> min_sandbox

* do not double measure disk

* set minimum disk usage from sandbox

* update comments

* disk only specified output and ephemeral files

* proportion with available disk - cached

* use already inuse_cache value in count_worker_resources

* use available_disk with t->input_files_size

* turn off prop whole tasks only for mem and disk

* check for user resources not specified

* fix conflict, input_size

* format

* macros to conver bytes to megabytes, etc.

* correctly account for inuse_cache units

* add DISK to vine_worker test

* format

---------

Co-authored-by: Benjamin Tovar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants