-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose column optimization more explicitly? #559
Comments
So, this isn't something like:
where the final step doesn't exist yet as a user function, but is done internally in column optimization. |
In principle yes, but
|
the typetracing and necessary_columns steps are the only expensive ones, making the graph is essentially free. The user probably does |
We need a way though to trace a function that could return anything; I'd say it's a limitation for physics analysis to only be able to return a single awkward-array. It's not uncommon to track any meta data in Apart from that: I think I can write |
I suppose I could use io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
dak.report_necessary_columns(dask.delayed(lambda io: io.nJet)(io.partitions[0]))
# {'from-uproot-138b384738005b2a7a7eefbb600ca6c2': frozenset({'nJet'})} That's great! However this doesn't work (returning additional metadata, here: dak.report_necessary_columns(dask.delayed(lambda io: (io.nJet, None))(io.partitions[0]))
# {'from-uproot-138b384738005b2a7a7eefbb600ca6c2': frozenset()}
# but it does compute correctly:
dask.delayed(lambda io: (io.nJet, None))(io.partitions[0]).compute()
# (<Array [5, 8, 5, 3, 5, 8, 4, 4, ..., 4, 9, 3, 2, 3, 1, 6, 2] type='40 * uint32'>, None)
If that would work, I'd like this for tracing 'manually'. You've got a good point that "making the graph is essentially free", and there's already Update: dak.report_necessary_columns(dask.delayed(lambda io: (io.nJet + 1, None))(io.partitions[0]))
# {'from-uproot-587894178ac0719cd1a04ae5f523da00': frozenset({'nJet'})} In the single return case the touching was forced because of the explicit single return, something to be aware of... |
Tagging issue to remind myself to think about this, when I start cleaning-up/reorganizing column-joining code. |
This would also allow for an improvement of the optimization process by:
This improvement would eliminate the need to build repetitive graphs for similar datasets by just cloning them; and it would eliminate repetitive typetracer runs. In many cases for HEP analyses I could even see that there are only very little differences needed for analyzing This currently assumes that |
Totally agree, @pfackeldey . We probably need a flag on the iolayer to say "this has already picked explicit columns, don't optimize", so that you can do this run-once, apply-many-times model with clone. Probably we would do the initial column choosing before clone, but it should be fast and not matter. |
Hi,
it would be nice if there's a way to use the column optimization without having to "buy into" building (large) compute graphs. One thing that could be useful is to expose an API for the column optimization specifically, enabled through a new namespace in dask-awkward called
manual
(because one is doing the column optimization now manually instead of automatically as part of the usual dask graph optimization).Consider the following example:
Why is this useful? - One can now continue with a non-dask compute model but still benefit from dask-awkward's partitioning and column optimization, e.g.:
This pattern allows to run this loop locally e.g. with multiprocessing, on batch systems as classic batch jobs, or you can fall back to continue working with
dask-awkward
onevents
(the normal use-case) with e.g.map_partitions
/mapfilter
/dak
-operations.Some additional benefits would be that tracing failures might be easier to debug, because there's a simple way to rerun only the tracing step:
dak.manual.typetrace(my_analysis, tracer)
. In addition, one would be able to reusetouched_columns
of one tracing step for other input datasets (where appropriate of course), inspect thetouched_columns
(I suppose there's alreadydak.report_necessary_columns
for this), or even manipulatetouched_columns
as needed.While there's still the standard way to do things, i.e., use
dask-awkward
operations and it's optimization procedures, themanual
namespace would allow people to benefit fromdask-awkward
's IO features (column optimization + partitioning) without having to build dask graphs (except for the one AwkwardInputLayer) for the computation ofmy_analysis
.What do you think about this?
The text was updated successfully, but these errors were encountered: