-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear how to manually do column projection with uproot.dask
(and API differences with dask-awkward
)
#1349
Comments
(pinging @lgray @agoose77 @kkothari2001 for ideas, input and help 🙏 ) |
There are different conventions for how the columns are named, and uproot encodes extra things in these names (because some columns are always required even when not in the output). The correct place to call project is probably on the layer, not the IO function, which provides a place to override the "awkward" column name to the "io convention" names. (This is something that the one-pass PR explicitly worked around, removing a lot of protocol classes and code in the process) |
Thanks for the clarification. |
@pfackeldey not much time answer in full here, but the different column projection conventions were intentional -- it reflects the different concepts of "column" between uproot, parquet, and form remapping! I would suggest not trying to remove that separation; it is a problem with the one-pass PR that tried to do so. Ultimately, column optimisation is really "Buffer Optimisation", and is a black-box for each array source. Will try to get to this. |
Thank you @agoose77 for your reply! I'd argue though that:
|
Apologies for terse replies: I'm in a meeting! (1) -- on the face of it, the Parquet example surprises me -- it's actually changing the type -- it should fail for |
I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.
With uproot.dask this looks as follows:
(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)
If I do this with parquet instead though, it works:
I don't understand why the above code example works for
dak.from_parquet
, but not foruproot.dask
, there seems to be a real difference in how the column projection is implemented for theio_func
of the dask layer.Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:
.project_keys()
vs.project_columns()
form_with_unique_keys
argument'<root>'
vs'@'
state
that holds the information of the trace is constructed differently: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/_dask.py#L1082-L1084 vs https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py#L104There are probably some more that I've not yet encountered.
In principle, it would be nice if
uproot.dask
would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code inuproot._dask
aswell.I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.
(If this API would be unified it would be rather easy to make
dak.project_columns
possible for allAwkwardInputLayer
kinds.)The text was updated successfully, but these errors were encountered: