Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..) #3567

Open
Lundez opened this issue Dec 13, 2024 · 8 comments
Open
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed p2 Nice to have features

Comments

@Lundez
Copy link

Lundez commented Dec 13, 2024

Is your feature request related to a problem?

Hi,

I'd like to see a simplified way to work with multiple columns in, multiple columns out.
One of the more pythonic approaches I've seen is to use dict[str, np.ndarray] -> dict[str, np.ndarray] (alternatively dict[str, Any]).

This approach is taken by Ray (map_batches) and HuggingFace Datasets (map)

Why is this important for Deep Learning?
When working with tasks such as Object Detection you need to transform the Bounding Box and Image the same way. Transforming could be done "in parallell", cumbersome but possible. It turns into a big problem when it comes to Augmenting data... Augmentation is commonly done with a probability p to be applied, and what is applied is also random (e.g. RandomCrop, RandomRescale, MixUp, ...). This means that the augmentation has to be applied exactly the same to both BBox and Image. Only way I see this is possible now is through building a struct, possible but not pythonic.

P.S. It's great that batch_size is already enabled as batched transforms are excellent for certain augmentations, e.g. MixUp.

Describe the solution you'd like

A multi-input, multi-output API for UDF's

Describe alternatives you've considered

I've thought of using struct but it's not as smooth as the more "pythonic" approach of using dict.

Wondering what your idea is.

Additional Context

import albumentations as A

transforms = A.Compose([
    A.RandomResizedCrop(size=(224, 224), antialias=True),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
],
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["category_id"]))
transforms(**sample)

is how albumentations is applied, where sample is a dict of values. transforms takes kwargs.
A guide on how Albumentations to use with HF Datasets.

Would you like to implement a fix?

Maybe, if you guide me I could try to get it done during the weekend.

@Lundez Lundez added enhancement New feature or request needs triage labels Dec 13, 2024
@andrewgazelka andrewgazelka added the p2 Nice to have features label Dec 13, 2024
@andrewgazelka
Copy link
Contributor

Do you have any thoughts @kevinzwang ?

@universalmind303
Copy link
Contributor

I think this is a good idea. I'm thinking we could use struct under the hood, but provide some nice abstractions over it to make the udf experience as seamless as possible.

@kevinzwang
Copy link
Member

Hi @Lundez, thanks bringing this up. I have a few questions:

  1. you should already be able to construct UDFs with multiple inputs by simply adding more arguments to your UDF. Does that work for you?
  2. it's true that UDFs don't have a great mechanism for outputting multiple values at the moment. Is there an interface that you would like to propose for this? The workaround at the moment that we recommend is returning a struct dtype as a list of dictionaries in your UDF. Then, you can expand the struct fields with col("struct_col.*").

Here's a quick example of doing multi-input multi-output with the things I mentioned above:

>>> import daft
>>> @daft.udf(return_dtype=daft.DataType.struct({
...     "x": daft.DataType.int64(),
...     "y": daft.DataType.int64(),
... }))
... def my_udf(a, b):
...     # simple UDF that just returns the two inputs as a struct column
...     result = []
...     for a_elem, b_elem in zip(a.to_pylist(), b.to_pylist()):
...         result.append({"x": a_elem, "y": b_elem})
...     return result
... 
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> # call UDF
>>> df = df.select(my_udf(df["a"], df["b"]).alias("udf_result"))
>>> # unnest struct fields
>>> df = df.select("udf_result.*")
>>> df.show()
╭───────┬───────╮
│ xy     │
│ ------   │
│ Int64Int64 │
╞═══════╪═══════╡
│ 14     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 25     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 36     │
╰───────┴───────╯

(Showing first 3 of 3 rows)

@kevinzwang
Copy link
Member

What we could maybe to is also support returning a dictionary of lists instead of a list of dictionaries for struct type columns.

@Lundez
Copy link
Author

Lundez commented Dec 13, 2024

Hi,

I know it's technically possible to do right now (as I noted with my comment regarding struct). If it's how you prefer the DX to be I'm fine.

I'm merely suggesting adding another way that feels easier to work with, which could potentially help adoption.

The col("struct.*") syntax was quite cool, though the ".unnest()" approach seems clearer (IMO).

Feel free to close issue if you're happy with the state of today 👍

@kevinzwang
Copy link
Member

Ah I see, thanks for the feedback. I do want to get around to improving the ergonomics of UDFs, I think we'll have some time after the new years to flesh it out. Will keep this issue open for others in the community to voice their thoughts too.

@kevinzwang
Copy link
Member

Here's my proposal:

  • Add something like an unnest_output parameter in @daft.udf that tells Daft to automatically convert a struct output into columns
  • more ways to return struct type arrays (in particular, dict of list)
  • a way to configure a UDF to take in an arbitrary amount of of input columns + something like selectors in Polars to allow users to easily pass in specific sets of columns.

@jaychia do you have any thoughts?

@Lundez
Copy link
Author

Lundez commented Dec 13, 2024

Here's my proposal:

  • Add something like an unnest_output parameter in @daft.udf that tells Daft to automatically convert a struct output into columns
  • more ways to return struct type arrays (in particular, dict of list)
  • a way to configure a UDF to take in an arbitrary amount of of input columns + something like selectors in Polars to allow users to easily pass in specific sets of columns.

@jaychia do you have any thoughts?

I like this.
And regarding selectors from polars, those are exceptional. Great idea to add!

@kevinzwang kevinzwang self-assigned this Dec 13, 2024
@ccmao1130 ccmao1130 added the help wanted Extra attention is needed label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed p2 Nice to have features
Projects
None yet
Development

No branches or pull requests

5 participants