Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`)

I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.

With uproot.dask this looks as follows:
```python
import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
io.compute()
# ... TypeError: PlaceholderArray supports only trivial slices, not int
```
(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)

If I do this with parquet instead though, it works:
```python
import dask_awkward as dak

io = dak.from_parquet("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.parquet")
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x15c4ce1a0>
# 0. from-parquet-150809c2f6f63708200b7f130d3a395d

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_columns(frozenset(["nJet"]))
  
# now compute, should only load `nJet`
io.compute()
# <Array [{nJet: 5}, {nJet: 8}, ..., {...}, {nJet: 2}] type='40 * {nJet: uint32}'>
```
I don't understand why the above code example works for `dak.from_parquet`, but not for `uproot.dask`, there seems to be a real difference in how the column projection is implemented for the `io_func` of the dask layer.

Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:
- `.project_keys()` vs `.project_columns()`
- `form_with_unique_keys` argument `'<root>'` vs `'@'`
- the `state` that holds the information of the trace is constructed differently: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/_dask.py#L1082-L1084 vs https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py#L104

There are probably some more that I've not yet encountered.

In principle, it would be nice if `uproot.dask` would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code in `uproot._dask` aswell.

I'm currently trying to find a way to unify the APIs and to find the reason of this difference here. 
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.

(If this API would be unified it would be rather easy to make [`dak.project_columns`](https://github.com/dask-contrib/dask-awkward/issues/559) possible for all `AwkwardInputLayer` kinds.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1349

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unclear how to manually do column projection with uproot.dask (and API differences with dask-awkward) #1349

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1349