Allow for finer-grained control of concurrency in file transfers #750

moradology · 2024-05-30T19:40:32Z

This PR enables transfer of files with two knobs to manage concurrency.
max_executors tells the transform how many groups of URLs to create (which should generally set an upper bound on the number of required worker nodes in the cluster)
concurrency_per_executor tells the system how many URLs per group can be concurrently opened for file transfer at maximum.

Total concurrency across the cluster will, of course, be a function of these two values with a theoretical maximum of max_executors * concurrency_per_executor

jbusecke · 2024-06-03T15:39:05Z

Thanks so much for this @moradology. Testing this currently in leap-stc/climsim_feedstock#7

jbusecke · 2024-06-04T13:46:50Z

While I am testing this I wanted to start a discussion about how this would be used/documented (and furthermore how/if injection would work here).

So my main question is: Once this works, should we entirely replace the caching option in OpenWithFSSpec? I wonder if there is any advantage in using that instead of this stage? Maybe somebody here has a usecase that would break? From my POV it is much more explicit and easy to understand with a separate stage. If we decide to go this route I would opt for a more intuitive name though. How about simply 'CacheFiles'?

Whether we go ahead with this as an optional stage or a replacement for the cache=... option in OpenWithFSSpec the way this is currently set up we are kind of breaking the paradigm of 'the recipe should not contain any information about the storage' - which is usually passed as config to runner.
Can we have a chat if we want to adapt the injection logic, so that I can define my cache target in the config like before (In my testing I needed to move the specification of the cache target to the recipe see leap-stc/climsim_feedstock#7).

jbusecke · 2024-06-04T17:36:07Z

Also testing this over at leap-stc/cmip6-leap-feedstock#170

moradology · 2024-06-04T18:52:42Z

The only reason I can see to keep caching around in OpenURLWithFSSpec is to maintain API compatibility - and then, only for a limited time. I personally favor being explicit about things like this to avoid painful contortions when expectations/needs change in ways we don't anticipate.

As for recipes not containing information about the specific storage location, I think your intuition is correct. There is some discussion here which may be of interest. I'm not sure there's strong consensus about how to approach things though, for my part, I'd prefer to see the dependency injection either abandoned (and people just fork/update recipes as needed) or facilitated via the (for my mind, at least) easier to reason about mechanism of python functions

jbusecke · 2024-06-05T15:43:26Z

The only reason I can see to keep caching around in OpenURLWithFSSpec is to maintain API compatibility - and then, only for a limited time. I personally favor being explicit about things like this to avoid painful contortions when expectations/needs change in ways we don't anticipate.

That seems like a great way forward. Maybe start a deprecation cycle by implementing a warning when cache is not None in OpenWithFsspec?

As for recipes not containing information about the specific storage location, I think your intuition is correct. There is some discussion here which may be of interest. I'm not sure there's strong consensus about how to approach things though, for my part, I'd prefer to see the dependency injection either abandoned (and people just fork/update recipes as needed) or facilitated via the (for my mind, at least) easier to reason about mechanism of python functions

Curious to discuss this further in the coming weeks. I have to admint that I do not 100% understand that issue, but curious to learn more.

jbusecke · 2024-06-05T16:17:58Z

Here is a suggestion to further improve the performance of this stage for recipes with a ton of files. Curious to hear if this would add too much complexity.

Can we decouple the 'check cache' and actual 'transfer/download to cache'?

The motivation here is that for the transfer we are limited by external factors (when does the server think we are DDOSing it), but for checking if files exist we can probably have a much higher concurency.

So currently we are doing (in an abstract way) this in every executor:

cached_files = 
#limit to external concurrency
for file in file_group:
   cached_url = check_and_cache(file)
   cached_files.append(cached_url)

Could we modify it to something like:

already_cached_files = []
needs_caching_files = []

# much higher concurrency
for file in file_group:
    status = check(file)
    if status='cached':
       already_cached_files.append(file)
    elif status='not_cached':
        needs_caching_files.append(file)

# limit to external concurrency
cached_files = []
for file in needs_caching_files:
    cached_url = cache(file)
   cached_files.append(cached_url)

return already_cached_files + cached_files # this might mess up order, not sure how we would handle that.

moradology · 2024-06-10T15:06:48Z

I'm thinking a simple transform to filter down already-transferred URLs is sensible ahead of more costly file transfers.

Here's a flag that should pre-filter as desired: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR220

jbusecke mentioned this pull request Jun 3, 2024

Test CheckpointFileTransfer from recipes PR leap-stc/climsim_feedstock#7

Open

3 tasks

moradology added 5 commits June 4, 2024 11:52

Allow for finer-grained control of concurrency in file transfers

e760b38

Update formatting of docstrings

e262057

Bit by the snake

86e8f39

Unkey after grouping to keep code cleaner

6fa56ea

Enable sync patching fsspec for file transfers

7167f96

moradology force-pushed the feature/concurrency-control branch from fa048f0 to 7167f96 Compare June 4, 2024 16:57

jbusecke mentioned this pull request Jun 4, 2024

Test new httpfs-sync and concurrent download leap-stc/cmip6-leap-feedstock#170

Merged

moradology added 2 commits June 4, 2024 14:09

Add exponential backoff to avoid rate limiting issues

e7b60e0

Linter demands unreachable exception. OK, fine

487f039

moradology force-pushed the feature/concurrency-control branch from 54a0b10 to 487f039 Compare June 4, 2024 19:21

Don't check size to avoid rate limiting issues

1bceb26

moradology mentioned this pull request Jun 10, 2024

Add verify_existing option to cache #630

Open

moradology added 3 commits June 10, 2024 10:46

Add flag to filter out already-transferred files

1b691cc

Use hashlib for determinacy of keyed groups

7bb3509

Lint

a851610

moradology force-pushed the feature/concurrency-control branch from 91a217e to a851610 Compare June 10, 2024 17:09

moradology mentioned this pull request Jun 12, 2024

Stage to check if URLS are alive? #719

Closed

moradology added 3 commits June 12, 2024 13:18

Attempt to log the file that is failing during xarray open

c306c95

Add error handling and more logging for file transfer

f78ce0d

lint

1c9321a

moradology force-pushed the feature/concurrency-control branch from 59c4579 to 1c9321a Compare June 12, 2024 19:05

Add verification of copy

5ee286d

moradology added 2 commits June 13, 2024 13:20

Add logging to understand keyerror during index

71f4061

Ensure that all URLs are returned

55a6580

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for finer-grained control of concurrency in file transfers #750

Allow for finer-grained control of concurrency in file transfers #750

moradology commented May 30, 2024 •

edited

Loading

jbusecke commented Jun 3, 2024

jbusecke commented Jun 4, 2024

jbusecke commented Jun 4, 2024

moradology commented Jun 4, 2024

jbusecke commented Jun 5, 2024

jbusecke commented Jun 5, 2024

moradology commented Jun 10, 2024 •

edited

Loading

Allow for finer-grained control of concurrency in file transfers #750

Are you sure you want to change the base?

Allow for finer-grained control of concurrency in file transfers #750

Conversation

moradology commented May 30, 2024 • edited Loading

jbusecke commented Jun 3, 2024

jbusecke commented Jun 4, 2024

jbusecke commented Jun 4, 2024

moradology commented Jun 4, 2024

jbusecke commented Jun 5, 2024

jbusecke commented Jun 5, 2024

moradology commented Jun 10, 2024 • edited Loading

moradology commented May 30, 2024 •

edited

Loading

moradology commented Jun 10, 2024 •

edited

Loading