Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear how to manually do column projection with uproot.dask (and API differences with dask-awkward) #1349

Open
pfackeldey opened this issue Dec 12, 2024 · 6 comments
Labels
question Open-ended questions from users

Comments

@pfackeldey
Copy link
Collaborator

pfackeldey commented Dec 12, 2024

I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.

With uproot.dask this looks as follows:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
io.compute()
# ... TypeError: PlaceholderArray supports only trivial slices, not int

(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)

If I do this with parquet instead though, it works:

import dask_awkward as dak

io = dak.from_parquet("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.parquet")
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x15c4ce1a0>
# 0. from-parquet-150809c2f6f63708200b7f130d3a395d

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_columns(frozenset(["nJet"]))
  
# now compute, should only load `nJet`
io.compute()
# <Array [{nJet: 5}, {nJet: 8}, ..., {...}, {nJet: 2}] type='40 * {nJet: uint32}'>

I don't understand why the above code example works for dak.from_parquet, but not for uproot.dask, there seems to be a real difference in how the column projection is implemented for the io_func of the dask layer.

Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:

There are probably some more that I've not yet encountered.

In principle, it would be nice if uproot.dask would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code in uproot._dask aswell.

I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.

(If this API would be unified it would be rather easy to make dak.project_columns possible for all AwkwardInputLayer kinds.)

@pfackeldey pfackeldey added docs Improvements or additions to documentation question Open-ended questions from users and removed docs Improvements or additions to documentation labels Dec 12, 2024
@pfackeldey
Copy link
Collaborator Author

(pinging @lgray @agoose77 @kkothari2001 for ideas, input and help 🙏 )

@martindurant
Copy link

There are different conventions for how the columns are named, and uproot encodes extra things in these names (because some columns are always required even when not in the output).

The correct place to call project is probably on the layer, not the IO function, which provides a place to override the "awkward" column name to the "io convention" names.

(This is something that the one-pass PR explicitly worked around, removing a lot of protocol classes and code in the process)

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Dec 13, 2024

Thanks for the clarification.
In current dask-awkward one would need to pass a report and state to layer.project(), which is far from user-friendly, especially since state is different for uproot.dask and dak.from_*.
So that's a really good thing about one-pass optimization that it can accept directly the column names inferred by dak.necessary_columns!

@agoose77
Copy link
Collaborator

agoose77 commented Dec 13, 2024

@pfackeldey not much time answer in full here, but the different column projection conventions were intentional -- it reflects the different concepts of "column" between uproot, parquet, and form remapping!

I would suggest not trying to remove that separation; it is a problem with the one-pass PR that tried to do so.

Ultimately, column optimisation is really "Buffer Optimisation", and is a black-box for each array source.

Will try to get to this.

@pfackeldey
Copy link
Collaborator Author

Thank you @agoose77 for your reply!
Ok, I understand that the concept of a "column" is different for uproot, parquet, etc., which is a good reason for the API differences.

I'd argue though that:

  1. my example written in this issue should not fail for uproot
  2. there should be a way to do the column projection with a list of strings (list of columns) for any kind of projectable IO layer (any format-specific complexity can be hidden inside the io_func/io_layer)

@agoose77
Copy link
Collaborator

agoose77 commented Dec 13, 2024

Apologies for terse replies: I'm in a meeting!

(1) -- on the face of it, the Parquet example surprises me -- it's actually changing the type -- it should fail for uproot because the repr is trying to view missing arrays. n.b., we talked about making placeholder reprs not throw errors.
(2) -- I think there's a naming problem -- the superset of optimisations is "buffer optimisation", so that should be the core API. If we want to support some notion of "column optimisation", it would need to sit on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Open-ended questions from users
Projects
None yet
Development

No branches or pull requests

3 participants