You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you use a literal column in a UDF, it will receive it as a series of a single value. Perhaps this is valid, but when you set batch_size for the UDF, this will actually fail to run altogether
To Reproduce
>>>importdaft>>>@daft.udf(return_dtype=daft.DataType.int64(), batch_size=1)
... defmy_sum(a, b):
... returna.sum(b)
...
>>>df=daft.from_pydict({"a": [1, 2, 3]})
>>>df=df.with_column("b", daft.lit(0))
>>>df.select(my_sum(df["a"], df["b"])).show()
ErrorwhenrunningpipelinenodeProjectOperator---------------------------------------------------------------------------RuntimeErrorTraceback (mostrecentcalllast)
CellIn[8], line1---->1df.select(my_sum(df["a"], df["b"])).show()
File~/Desktop/Daft/daft/api_annotations.py:26, inDataframePublicAPI.<locals>._wrap(*args, **kwargs)
24type_check_function(func, *args, **kwargs)
25timed_method=time_df_method(func)
--->26returntimed_method(*args, **kwargs)
File~/Desktop/Daft/daft/analytics.py:199, intime_df_method.<locals>.tracked_method(*args, **kwargs)
197start=time.time()
198try:
-->199result=method(*args, **kwargs)
200exceptExceptionase:
201_ANALYTICS_CLIENT.track_df_method_call(
202method_name=method.__name__, duration_seconds=time.time() -start, error=str(type(e).__name__)
203 )
File~/Desktop/Daft/daft/dataframe/dataframe.py:2641, inDataFrame.show(self, n)
2628 @DataframePublicAPI2629defshow(self, n: int=8) ->None:
2630"""Executes enough of the DataFrame in order to display the first ``n`` rows. 2631 2632 If IPython is installed, this will use IPython's `display` utility to pretty-print in a (...) 2639 n: number of rows to show. Defaults to 8. 2640 """->2641dataframe_display=self._construct_show_display(n)
2642try:
2643fromIPython.displayimportdisplayFile~/Desktop/Daft/daft/dataframe/dataframe.py:2598, inDataFrame._construct_show_display(self, n)
2596tables= []
2597seen=0->2598fortableinget_context().get_or_create_runner().run_iter_tables(builder, results_buffer_size=1):
2599tables.append(table)
2600seen+=len(table)
File~/Desktop/Daft/daft/runners/native_runner.py:89, inNativeRunner.run_iter_tables(self, builder, results_buffer_size)
86defrun_iter_tables(
87self, builder: LogicalPlanBuilder, results_buffer_size: int|None=None88 ) ->Iterator[MicroPartition]:
--->89forresultinself.run_iter(builder, results_buffer_size=results_buffer_size):
90yieldresult.partition()
File~/Desktop/Daft/daft/runners/native_runner.py:84, inNativeRunner.run_iter(self, builder, results_buffer_size)
78executor=NativeExecutor.from_logical_plan_builder(builder)
79results_gen=executor.run(
80 {k: v.values() fork, vinself._part_set_cache.get_all_partition_sets().items()},
81daft_execution_config,
82results_buffer_size,
83 )
--->84yieldfromresults_genFile~/Desktop/Daft/daft/execution/native_executor.py:40, in<genexpr>(.0)
35fromdaft.runners.partitioningimportLocalMaterializedResult37psets_mp= {
38part_id: [part.micropartition()._micropartitionforpartinparts] forpart_id, partsinpsets.items()
39 }
--->40return (
41LocalMaterializedResult(MicroPartition._from_pymicropartition(part))
42forpartinself._executor.run(psets_mp, daft_execution_config, results_buffer_size)
43 )
File~/Desktop/Daft/daft/udf.py:140, inrun_udf(func, bound_args, evaluated_expressions, py_return_dtype, batch_size)
136else:
137# all inputs must have the same lengths for batching138# not sure this error can possibly be triggered but it's here139iflen(set(len(s) forsinevaluated_expressions)) !=1:
-->140raiseRuntimeError(
141f"User-defined function `{func}` failed: cannot run in batches when inputs are different lengths: {tuple(len(series) forseriesinevaluated_expressions)}"142 )
144results= []
145foriinrange(0, len(evaluated_expressions[0]), batch_size):
RuntimeError: User-definedfunction`<function my_sum at 0x14adc5440>`failed: cannotruninbatcheswheninputsaredifferentlengths: (3, 1)
Expected behavior
We should expect the UDF to work with a literal column and a batch_size set, and we might also want to consider broadcasting the literal value so that the UDF will always receive equal length series arguments.
Component(s)
Expressions, Other
Additional context
No response
The text was updated successfully, but these errors were encountered:
Describe the bug
When you use a literal column in a UDF, it will receive it as a series of a single value. Perhaps this is valid, but when you set
batch_size
for the UDF, this will actually fail to run altogetherTo Reproduce
Expected behavior
We should expect the UDF to work with a literal column and a batch_size set, and we might also want to consider broadcasting the literal value so that the UDF will always receive equal length series arguments.
Component(s)
Expressions, Other
Additional context
No response
The text was updated successfully, but these errors were encountered: