-
Notifications
You must be signed in to change notification settings - Fork 88
Description
I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.
With uproot.dask this looks as follows:
import uproot
io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
# 0. from-uproot-138b384738005b2a7a7eefbb600ca6c2
# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))
# now compute, should only load `nJet`
io.compute()
# ... TypeError: PlaceholderArray supports only trivial slices, not int(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)
If I do this with parquet instead though, it works:
import dask_awkward as dak
io = dak.from_parquet("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.parquet")
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x15c4ce1a0>
# 0. from-parquet-150809c2f6f63708200b7f130d3a395d
# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
lay.io_func = lay.io_func.project_columns(frozenset(["nJet"]))
# now compute, should only load `nJet`
io.compute()
# <Array [{nJet: 5}, {nJet: 8}, ..., {...}, {nJet: 2}] type='40 * {nJet: uint32}'>I don't understand why the above code example works for dak.from_parquet, but not for uproot.dask, there seems to be a real difference in how the column projection is implemented for the io_func of the dask layer.
Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:
.project_keys()vs.project_columns()form_with_unique_keysargument'<root>'vs'@'- the
statethat holds the information of the trace is constructed differently: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/_dask.py#L1082-L1084 vs https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py#L104
There are probably some more that I've not yet encountered.
In principle, it would be nice if uproot.dask would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code in uproot._dask aswell.
I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.
(If this API would be unified it would be rather easy to make dak.project_columns possible for all AwkwardInputLayer kinds.)