Skip to content

Unclear how to manually do column projection with uproot.dask (and API differences with dask-awkward) #1349

@pfackeldey

Description

@pfackeldey

I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.

With uproot.dask this looks as follows:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
io.compute()
# ... TypeError: PlaceholderArray supports only trivial slices, not int

(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)

If I do this with parquet instead though, it works:

import dask_awkward as dak

io = dak.from_parquet("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.parquet")
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x15c4ce1a0>
# 0. from-parquet-150809c2f6f63708200b7f130d3a395d

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_columns(frozenset(["nJet"]))
  
# now compute, should only load `nJet`
io.compute()
# <Array [{nJet: 5}, {nJet: 8}, ..., {...}, {nJet: 2}] type='40 * {nJet: uint32}'>

I don't understand why the above code example works for dak.from_parquet, but not for uproot.dask, there seems to be a real difference in how the column projection is implemented for the io_func of the dask layer.

Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:

There are probably some more that I've not yet encountered.

In principle, it would be nice if uproot.dask would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code in uproot._dask aswell.

I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.

(If this API would be unified it would be rather easy to make dak.project_columns possible for all AwkwardInputLayer kinds.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionOpen-ended questions from users

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions