-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for finer-grained control of concurrency in file transfers #750
base: main
Are you sure you want to change the base?
Conversation
Thanks so much for this @moradology. Testing this currently in leap-stc/climsim_feedstock#7 |
While I am testing this I wanted to start a discussion about how this would be used/documented (and furthermore how/if injection would work here). So my main question is: Once this works, should we entirely replace the caching option in Whether we go ahead with this as an optional stage or a replacement for the |
fa048f0
to
7167f96
Compare
Also testing this over at leap-stc/cmip6-leap-feedstock#170 |
The only reason I can see to keep caching around in As for recipes not containing information about the specific storage location, I think your intuition is correct. There is some discussion here which may be of interest. I'm not sure there's strong consensus about how to approach things though, for my part, I'd prefer to see the dependency injection either abandoned (and people just fork/update recipes as needed) or facilitated via the (for my mind, at least) easier to reason about mechanism of python functions |
54a0b10
to
487f039
Compare
That seems like a great way forward. Maybe start a deprecation cycle by implementing a warning when cache is not None in
Curious to discuss this further in the coming weeks. I have to admint that I do not 100% understand that issue, but curious to learn more. |
Here is a suggestion to further improve the performance of this stage for recipes with a ton of files. Curious to hear if this would add too much complexity.
The motivation here is that for the transfer we are limited by external factors (when does the server think we are DDOSing it), but for checking if files exist we can probably have a much higher concurency. So currently we are doing (in an abstract way) this in every executor:
Could we modify it to something like:
|
I'm thinking a simple transform to filter down already-transferred URLs is sensible ahead of more costly file transfers. Here's a flag that should pre-filter as desired: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR220 |
91a217e
to
a851610
Compare
59c4579
to
1c9321a
Compare
This PR enables transfer of files with two knobs to manage concurrency.
max_executors
tells the transform how many groups of URLs to create (which should generally set an upper bound on the number of required worker nodes in the cluster)concurrency_per_executor
tells the system how many URLs per group can be concurrently opened for file transfer at maximum.Total concurrency across the cluster will, of course, be a function of these two values with a theoretical maximum of
max_executors
*concurrency_per_executor